| $\overline{}$ |            |   |      |                                                                                                                                                                                             |                                     |  |
|---------------|------------|---|------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------|--|
|               | L          | # | Hits | Search Text                                                                                                                                                                                 | DBs                                 |  |
| 1             | L1         |   | 114  | <pre>(plural plurality multiple multiplicity several two<br/>second) adj2 (pc pa ip ((instruction program) adj2<br/>(counter address))) near30 (multithread\$3 thread\$3<br/>context)</pre> | USPAT;<br>US-PGPUB                  |  |
| 2             | L2         |   | 2859 | <pre>(plural plurality multiple multiplicity several two<br/>second) adj2 (pc pa ip ((instruction program) adj2<br/>(counter address))) and (multithread\$3 thread\$3 context)</pre>        | USPAT;<br>US-PGPUB                  |  |
| 3             | L3         |   | 683  | <pre>(select\$3 interleav\$3 execut\$3) near10 ((cycl\$6 robin) near20 (multithread\$3 thread\$3 context))</pre>                                                                            | USPAT;<br>US-PGPUB                  |  |
| 4             | L4         |   | 137  | (fine adj2 grain\$3) near20 (multithread\$3 thread\$3 context)                                                                                                                              | USPAT;<br>US-PGPUB                  |  |
| 5             | L5         |   | 86   | 2 and (3 4)                                                                                                                                                                                 | USPAT;<br>US-PGPUB                  |  |
| 6             | L6         |   | 63   | 1 not 5                                                                                                                                                                                     | USPAT;<br>US-PGPUB                  |  |
| 7             | L7         |   | 7    | <pre>(plural plurality multiple multiplicity several two<br/>second) adj2 (pc pa ip ((instruction program) adj2<br/>(counter address))) near30 (multithread\$3 thread\$3<br/>context)</pre> | EPO;<br>JPO;<br>DERWENT;<br>IBM_TDB |  |
| 8             | L8         |   | 22   | (plural plurality multiple multiplicity several two second) adj2 (pc pa ip ((instruction program) adj2 (counter address))) and (multithread\$3 thread\$3 context)                           | EPO;<br>JPO;<br>DERWENT;<br>IBM_TDB |  |
| 9             | <b>Ľ</b> 9 |   | 61   | (select\$3 interleav\$3 execut\$3) near10 ((cycl\$6 robin) near20 (multithread\$3 thread\$3 context))                                                                                       | EPO;<br>JPO;<br>DERWENT;<br>IBM_TDB |  |
| 10            | L10        |   | 28   | (fine adj2 grain\$3) near20 (multithread\$3 thread\$3<br>context)                                                                                                                           | EPO;<br>JPO;<br>DERWENT;<br>IBM_TDB |  |

| FIG. 51 |                  |  |  |  |  |  |  |  |
|---------|------------------|--|--|--|--|--|--|--|
| FRREC   | FUNCTIONAL UNITS |  |  |  |  |  |  |  |
| DCACHE  | LORAB/STK CACHE  |  |  |  |  |  |  |  |

|    | Docum<br>ent<br>ID           | ט | Title                                                                                                                                     | Current<br>OR |
|----|------------------------------|---|-------------------------------------------------------------------------------------------------------------------------------------------|---------------|
| 1  | US<br>20040<br>07377<br>8 A1 |   | Parallel processor architecture                                                                                                           | 712/220       |
| 2  | US<br>20040<br>05488<br>0 A1 | ⊠ | Microengine for parallel processor architecture                                                                                           | 712/245       |
| 3  | US<br>20030<br>18956<br>5 A1 | ⊠ | Single semiconductor graphics platform system and method with skinning, swizzling and masking capabilities                                | 345/418       |
| 4  | US<br>20030<br>16366<br>9 A1 | ⊠ | Configuration of multi-cluster processor from single wide thread to two half-width threads                                                | 712/24        |
| 5  | US<br>20030<br>14988<br>8 A1 | ☒ | Integrated network intrusion detection                                                                                                    | 713/200       |
| 6  | US<br>20030<br>14515<br>9 A1 | ⊠ | SRAM controller for parallel processor architecture                                                                                       | 711/104       |
| 7  | US<br>20030<br>11224<br>6 A1 | ⊠ | Blending system and method in an integrated computer graphics pipeline                                                                    | 345/519       |
| 8  | US<br>20030<br>11224<br>5 A1 | ☒ | Single semiconductor graphics platform                                                                                                    | 345/506       |
| 9  | US<br>20030<br>11036<br>6 A1 | ⊠ | Run-ahead program execution with value prediction                                                                                         | 712/225       |
| 10 | US<br>20030<br>10590<br>1 A1 | ⊠ | PARALLEL MULTI-THREADED PROCESSING                                                                                                        | 710/240       |
| 11 | US<br>20030<br>10562<br>0 A1 | ⊠ | System, method and article of manufacture for interface constructs in a programming language capable of programming hardware architetures | 703/22        |
| 12 | US<br>20030<br>10305<br>4 A1 | ⊠ | Integrated graphics processing unit with antialiasing                                                                                     | 345/506       |
| 13 | US<br>20030<br>10305<br>0 A1 | × | Masking system and method for a graphics processing framework embodied on a single semiconductor platform                                 | 345/426       |
| 14 | US<br>20030<br>09754<br>8 A1 | ⊠ | Context execution in pipelined computer processor                                                                                         | 712/228       |
| 15 | US<br>20030<br>07417<br>7 A1 | ⋈ | System, method and article of manufacture for a simulator plug-in for co-simulation purposes                                              | 703/22        |
| 16 | US<br>20030<br>04667<br>1 A1 | Ø | System, method and article of manufacture for signal constructs in a programming language capable of programming hardware architectures   | 717/141       |
| 17 | US<br>20030<br>04666<br>8 A1 | ⊠ | System, method and article of manufacture for distributing IP cores                                                                       | 717/131       |



Sheet 21 of 45

|    | Docum<br>ent<br>ID           | U  | Title                                                                                                                                                    | Current<br>OR |
|----|------------------------------|----|----------------------------------------------------------------------------------------------------------------------------------------------------------|---------------|
| 18 | US<br>20030<br>03880<br>8 A1 | ⊠  | Method, apparatus and article of manufacture for a sequencer in a transform/lighting module capable of processing multiple independent execution threads | 345/506       |
| 19 | US<br>20030<br>03732<br>1 A1 | ⊠  | System, method and article of manufacture for extensions in a programming lanauage capable of programming hardware architectures                         | 717/149       |
| 20 | US<br>20030<br>03497<br>5 A1 | ⊠. | Lighting system and method for a graphics processor                                                                                                      | 345/426       |
| 21 | US<br>20030<br>03359<br>4 A1 | ⊠  | System, method and article of manufacture for parameterized expression libraries                                                                         | 717/141       |
| 22 | US<br>20030<br>03358<br>8 A1 | ⊠  | System, method and article of manufacture for using a library map to create and maintain IP cores effectively                                            | 717/107       |
| 23 | US<br>20030<br>02886<br>4 A1 | ⊠  | System, method and article of manufacture for successive compilations using incomplete parameters                                                        | 717/141       |
| 24 | US<br>20030<br>02072<br>0 A1 | ⊠  | Method, apparatus and article of manufacture for a sequencer in a transform/lighting module capable of processing multiple independent execution threads | 345/506       |
| 25 | US<br>20030<br>00526<br>2 A1 | ⊠  | Mechanism for providing high instruction fetch bandwidth in a multi-threaded processor                                                                   | 712/207       |
| 26 | US<br>20020<br>19917<br>3 A1 | ⊠  | System, method and article of manufacture for a debugger capable of operating across multiple threads and lock domains                                   | 717/129       |
| 27 | US<br>20020<br>19625<br>9 A1 | ⊠  | Single semiconductor graphics platform with blending and fog capabilities                                                                                | 345/506       |
| 28 | US<br>20020<br>18074<br>0 A1 | ⊠  | Clipping system and method for a single graphics semiconductor platform                                                                                  | 345/506       |
| 29 | US<br>20020<br>12922<br>7 A1 | ×  | Processor having priority changing function according to threads                                                                                         | 712/228       |
| 30 | US<br>20020<br>10551<br>9 A1 | ⊠  | Clipping system and method for a graphics processing framework embodied on a single semiconductor platform                                               | 345/426       |
| 31 | US<br>20020<br>09191<br>5 A1 |    | Load prediction and thread identification in a multithreaded microprocessor                                                                              | 712/225       |
| 32 | US<br>20020<br>07812<br>1 A1 | ⊠  | Real-time scheduler                                                                                                                                      | 718/102       |
| 33 | US<br>20020<br>06600<br>5 A1 | ⊠  | Data processor with an improved data dependence detector                                                                                                 | 712/218       |
| 34 | US<br>20020<br>05603<br>7 A1 | ⊠  | Method and apparatus for providing large register address space while maximizing cycletime performance for a multi-threaded register file set            | 712/215       |

Jan. 26, 1999 Sheet 22 of 45



**EIG. 28** 



FIG. 29

|    | Docum<br>ent<br>ID           | σ | Title                                                                                                                                                         | Current<br>OR |
|----|------------------------------|---|---------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------|
| 35 | US<br>20020<br>04784<br>6 A1 | ⊠ | System, method and computer program product for performing a scissor operation in a graphics processing framework embodied on a single semiconductor platform | 345/522       |
| 36 | US<br>20020<br>04632<br>5 A1 | ⊠ | Buffer memory management in a system having multiple execution entities                                                                                       | 711/122       |
| 37 | US<br>20020<br>02755<br>3 A1 | ⊠ | Diffuse-coloring system and method for a graphics processing framework embodied on a single semiconductor platform                                            | 345/426       |
| 38 | US<br>20020<br>01386<br>1 A1 | ☒ | Method and apparatus for low overhead multithreaded communication in a parallel processing environment                                                        | 719/313       |
| 39 | US<br>20020<br>01079<br>3 A1 | ⊠ | METHOD AND APPARATUS FOR PERFORMING FRAME PROCESSING FOR A NETWORK                                                                                            | 709/240       |
| 40 | US<br>20010<br>05645<br>6 A1 | ⊠ | PRIORITY BASED SIMULTANEOUS MULTI-THREADING                                                                                                                   | 718/103       |
| 41 | US<br>20010<br>05205<br>3 A1 | ⊠ | Stream processing unit for a multi-streaming processor                                                                                                        | 711/138       |
| 42 | US<br>20010<br>04977<br>0 A1 | ⊠ | BUFFER MEMORY MANAGEMENT IN A SYSTEM HAVING MULTIPLE EXECUTION ENTITIES                                                                                       | 711/129       |
| 43 | US<br>20010<br>04746<br>8 A1 | ⊠ | Branch and return on blocked load or store                                                                                                                    | 712/228       |
| 44 | US<br>20010<br>03744<br>5 A1 | ⊠ | Cycle count replication in a simultaneous and redundantly threaded processor                                                                                  | 712/216       |
| 45 | US<br>20010<br>01762<br>6 A1 | × | Graphics processing unit with transform module capable of handling scalars and vectors                                                                        | 345/501       |
| 46 | US<br>20010<br>00520<br>9 A1 |   | Method, apparatus and article of manufacture for a transform module in a graphics processor                                                                   | 345/506       |
| 47 | US<br>66913<br>01 B2         | ⊠ | System, method and article of manufacture for signal constructs in a programming language capable of programming hardware architectures                       | 717/114       |
| 48 | US<br>66683<br>17 B1         | ☒ | Microengine for parallel processor architecture                                                                                                               | 712/245       |
| 49 | US<br>66584<br>47 B2         | Ø | Priority based simultaneous multi-threading                                                                                                                   | 718/103       |
| 50 | US<br>66503<br>31 B2         | Ø | System, method and computer program product for performing a scissor operation in a graphics processing framework embodied on a single semiconductor platform | 345/522       |
| 51 | US<br>66503<br>30 B2         | ☒ | Graphics system and method for processing multiple independent execution threads                                                                              | 345/506       |
| 52 | US<br>66503<br>25 B1         | ☒ | Method, apparatus and article of manufacture for boustrophedonic rasterization                                                                                | 345/426       |
| 53 | US<br>66256<br>54 B1         | Ø | Thread signaling in multi-threaded network processor                                                                                                          | 709/230       |



Sheet 23 of 45

**EIC: 30** 



**FIG. 31** 

|    | Docum<br>ent<br>ID   | ט | Title                                                                                                                                                    | Current<br>OR |
|----|----------------------|---|----------------------------------------------------------------------------------------------------------------------------------------------------------|---------------|
| 54 | US<br>66067<br>04 B1 | ⊠ | Parallel multithreaded processor with plural microengines executing multiple threads each microengine having loadable microcode                          | 712/248       |
| 55 | US<br>65973<br>56 B1 | ⊠ | Integrated tessellator in a graphics processing unit                                                                                                     | 345/423       |
| 56 | US<br>65879<br>06 B2 | ☒ | Parallel multi-threaded processing                                                                                                                       | 710/240       |
| 57 | US<br>65781<br>37 B2 | ☒ | Branch and return on blocked load or store                                                                                                               | 712/228       |
| 58 | US<br>65773<br>09 B2 | ⊠ | System and method for a graphics processing framework embodied utilizing a single semiconductor platform                                                 | 345/426       |
| 59 | US<br>65739<br>00 B1 | Ø | Method, apparatus and article of manufacture for a sequencer in a transform/lighting module capable of processing multiple independent execution threads | 345/537       |
| 60 | US<br>65642<br>67 B1 | × | Network adapter with large frame transfer emulation                                                                                                      | 709/250       |
| 61 | US<br>65325<br>09 B1 | ☒ | Arbitrating command requests in a parallel multi-threaded processing system                                                                              | 710/240       |
| 62 | US<br>65156<br>71 B1 | ☒ | Method, apparatus and article of manufacture for a vertex attribute buffer in a graphics processor                                                       | 345/506       |
| 63 | US<br>65045<br>42 B1 | ⊠ | Method, apparatus and article of manufacture for area rasterization using sense points                                                                   | 345/441       |
| 64 | US<br>64704<br>43 B1 | ☒ | Pipelined multi-thread processor selecting thread instruction in inter-stage buffer based on count information                                           | 712/205       |
| 65 | US<br>64704<br>22 B2 | ⊠ | Buffer memory management in a system having multiple execution entities                                                                                  | 711/129       |
| 66 | US<br>64627<br>37 B2 | ⊠ | Clipping system and method for a graphics processing framework embodied on a single semiconductor platform                                               | 345/426       |
| 67 | US<br>64525<br>95 B1 | ⊠ | Integrated graphics processing unit with antialiasing                                                                                                    | 345/426       |
| 68 | US<br>64271<br>96 B1 | × | SRAM controller for parallel processor architecture including address and command queue and arbiter                                                      | 711/158       |
| 69 | US<br>64178<br>51 B1 | ⊠ | Method and apparatus for lighting module in a graphics processor                                                                                         | 345/426       |
| 70 | US<br>63779<br>98 B2 | Ø | Method and apparatus for performing frame processing for a network                                                                                       | 709/236       |
| 71 | US<br>63743<br>67 B1 | ⊠ | Apparatus and method for monitoring a computer system to guide optimization                                                                              | 714/37        |
| 72 | US<br>63634<br>75 B1 | ⊠ | Apparatus and method for program level parallelism in a VLIW processor                                                                                   | 712/206       |
| 73 | US<br>63534<br>39 B1 | Ø | System, method and computer program product for a blending operation in a transform module of a computer graphics pipeline                               | 345/561       |
| 74 | US<br>63493<br>63 B1 | ⊠ | Multi-section cache with different attributes for each section                                                                                           | 711/129       |
| 75 | US<br>63428<br>88 B1 | ☒ | Graphics processing unit with an integrated fog and blending operation                                                                                   | 345/426       |
| 76 | US<br>63306<br>61 B1 | ⊠ | Reducing inherited logical to physical register mapping information between tasks in multithread system using register group identifier                  | 712/228       |



**FIG. 32** 

|    | Docum<br>ent<br>ID   | ם | Title                                                                                    | Current<br>OR |
|----|----------------------|---|------------------------------------------------------------------------------------------|---------------|
| 77 | US<br>62956<br>00 B1 | ⊠ | Thread switch on blocked load or store using instruction thread field                    | 712/228       |
| 78 | US<br>62894<br>46 B1 | × | Exception handling utilizing call instruction with context information                   | 712/244       |
| 79 | US<br>62162<br>20 B1 | ☒ | Multithreaded data processing method with long latency subinstructions                   | 712/219       |
| 80 | US<br>61984<br>88 B1 | ⊠ | Transform, lighting and rasterization system embodied on a single semiconductor platform | 345/426       |
| 81 | US<br>61700<br>51 B1 | ⊠ | Apparatus and method for program level parallelism in a VLIW processor                   | 712/225       |
| 82 | US<br>60731<br>59 A  | Ø | Thread properties attribute vector based thread selection in multithreading processor    | 718/103       |
| 83 | US<br>59499<br>94 A  | ⊠ | Dedicated context-cycling computer with timed context                                    | 712/228       |
| 84 | US<br>59336<br>27 A  | × | Thread switch on blocked load or store using instruction thread field                    | 712/228       |
| 85 | US<br>58549<br>22 A  |   | Micro-sequencer apparatus and method of combination state machine and instruction memory | 712/245       |
| 86 | US<br>55749<br>22 A  | × | Processor with sequences of processor instructions for locked memory updates             | 712/220       |



**FIG. 34** REZS Bodnis Bodnis **KE20 BE21 BE23 BOLVIDO** BOLADI BONDE **OTINGDA** IONADA ΔΔ ATAŒ LIKOB-KEZOT. TZBEZI TZBEZO **AKOD NKID** AKSD **AK3D** REGISTER FILE ат Фру 🗸 🗸 🗸 АТЧЯРТЯ Т <u> 4 4 4 мвзьтк</u> д үү үкзык FIRDI-LESTINATION PTRE DPERAND COMPARATORS **BDOPTR2** INTYSIN Satysins RDIPTR2 RIJSPTR2 RDOPTRI **RDOPTRL** RESPTRI

|    | -,                           |   |                                                                                                    |               |
|----|------------------------------|---|----------------------------------------------------------------------------------------------------|---------------|
|    | Docum<br>ent<br>ID           | ΰ | Title                                                                                              | Current<br>OR |
| 1  | US<br>20040<br>06224<br>5 A1 |   | TCP/IP offload device                                                                              | 370/392       |
| 2  | US<br>20040<br>05500<br>3 A1 | ☒ | Uniprocessor operating system design facilitating fast context swtiching                           | 718/108       |
| 3  | US<br>20040<br>04250<br>7 A1 | ⊠ | Method and apparatus for fast change of internet protocol<br>headers compression mechanism         | 370/521       |
| 4  | US<br>20040<br>03470<br>8 A1 | Ø | Method and apparatus for fast internet protocol headers compression initialization                 | 709/227       |
| 5  | US<br>20040<br>03087<br>3 A1 | ⊠ | Single chip multiprocessing microprocessor having synchronization register file                    | 712/245       |
| 6  | US<br>20030<br>22968<br>3 A1 | ☒ | Information providing system and method and storage medium                                         | 709/219       |
| 7  | US<br>20030<br>15430<br>7 A1 | ⊠ | Method and apparatus for aggregate network address routes                                          | 709/245       |
| 8  | US<br>20030<br>14515<br>5 A1 | ☒ | Data transfer mechanism                                                                            | 711/104       |
| 9  | US<br>20030<br>09357<br>1 A1 | ☒ | Information providing system and method and storage medium                                         | 709/248       |
| 10 | US<br>20030<br>07454<br>2 A1 | ⊠ | Multiprocessor system and program optimizing method                                                | 712/20        |
| 11 | US<br>20030<br>06021<br>0 A1 | ⊠ | System and method for providing real-time and non-real-time services over a communications system  | 455/452<br>.1 |
| 12 | US<br>20030<br>02623<br>0 A1 | ⊠ | Proxy duplicate address detection for dynamic address allocation                                   | 370/338       |
| 13 | US<br>20030<br>01447<br>2 A1 | ⊠ | Thread ending method and device and parallel processor system                                      | 718/107       |
| 14 | US<br>20030<br>00974<br>3 A1 |   | METHOD AND APPARATUS FOR PRE-PROCESSING AND PACKAGING CLASS<br>FILES                               | 717/117       |
| 15 | US<br>20020<br>19161<br>2 A1 |   | Method and apparatus for automatically determining an appropriate transmission method in a network | 370/392       |
| 16 | US<br>20020<br>18604<br>6 A1 | ⊠ | Circuit architecture for reduced-synchrony on-chip<br>interconnect                                 | 326/47        |
| 17 | US<br>20020<br>18144<br>8 A1 | ⊠ | Prevention of spoofing in telecommunications systems                                               | 370/352       |



**EIG. 35** 

| FIROB-RESULT DATA          | RDATA-CTL      |
|----------------------------|----------------|
| REGISTER FILE              | REGF-CTL       |
| FIROB-DESTINATION POINTERS | SUTATS  GLOBAL |
| ОРЕКАИD СОМРАКАТОКЅ        | FIROB-CTL      |

**EIC**: 38

|    | Docum<br>ent<br>ID           | ט | Title                                                                                                                                        | Current<br>OR |
|----|------------------------------|---|----------------------------------------------------------------------------------------------------------------------------------------------|---------------|
| 18 | US<br>20020<br>16980<br>4 A1 | ⊠ | System and method for storage space optimized memorization and generation of web pages                                                       | 715/513       |
| 19 | US<br>20020<br>09984<br>4 A1 | Ø | Load balancing and dynamic control of multiple data streams in a network                                                                     | 709/232       |
| 20 | US<br>20020<br>05596<br>4 A1 | ☒ | Software controlled pre-execution in a multithreaded processor                                                                               | 718/107       |
| 21 | US<br>20020<br>04543<br>7 A1 | ⋈ | Tracing a location of a mobile device                                                                                                        | 455/411       |
| 22 | US<br>20020<br>02320<br>2 A1 | × | Load value queue input replication in a simultaneous and redundantly threaded processor                                                      | 712/225       |
| 23 | US<br>20020<br>01071<br>0 A1 | ⊠ | Method for characterizing a complex system                                                                                                   | 715/500       |
| 24 | US<br>20010<br>05211<br>2 A1 | ☒ | Method and apparatus for developing software                                                                                                 | 717/100       |
| 25 | US<br>20010<br>03744<br>8 A1 | ☒ | Input replicator for interrupts in a simultaneous and redundantly threaded processor                                                         | 712/244       |
| 26 | US<br>20010<br>03744<br>7 A1 | ⊠ | Simultaneous and redundantly threaded processor branch outcome queue                                                                         | 712/239       |
| 27 | US<br>20010<br>03625<br>5 Al | ☒ | Methods and apparatus for providing speech recognition services to communication system users                                                | 379/88.<br>01 |
| 28 | US<br>20010<br>03485<br>4 Al | ⊠ | Simultaneous and redundantly threaded processor uncached load address comparator and data value replication circuit                          | 714/5         |
| 29 | US<br>20010<br>03482<br>7 A1 | ⊠ | Active load address buffer                                                                                                                   | 712/225       |
| 30 | US<br>20010<br>03482<br>4 A1 | ⊠ | Simultaneous and redundantly threaded processor store instruction comparator                                                                 | 712/215       |
| 31 | US<br>20010<br>02947<br>8 A1 | ☒ | System and method for supporting online auctions                                                                                             | 705/37        |
| 32 | US<br>20010<br>01689<br>9 A1 | ☒ | Data-processing device                                                                                                                       | 712/215       |
| 33 | US<br>20010<br>00588<br>0 A1 |   | Information-processing device that executes general-purpose processing and transaction processing                                            | 712/34        |
| 34 | US<br>67078<br>13 B1         |   | Method of call control to minimize delays in launching<br>multimedia or voice calls in a packet-switched radio<br>telecommunications network | 370/356       |
| 35 | US<br>66679<br>88 B1         | ☒ | System and method for multi-level context switching in an electronic network                                                                 | 370/463       |



FIG. 37



**EIC: 38** 

| -  | Docum<br>ent<br>ID   | ט | Title                                                                                                                                                | Current        |
|----|----------------------|---|------------------------------------------------------------------------------------------------------------------------------------------------------|----------------|
| 36 | US<br>65981<br>51 B1 | ⊠ | Stack Pointer Management                                                                                                                             | 712/228        |
| 37 | US<br>65981<br>22 B2 | ⋈ | Active load address buffer                                                                                                                           | 711/126        |
| 38 | US<br>65565<br>63 B1 | ⊠ | Intelligent voice bridging                                                                                                                           | 370/352        |
| 39 | US<br>65325<br>54 B1 | ⊠ | Network event correlation system using formally specified models of protocol behavior                                                                | 714/43         |
| 40 | US<br>65300<br>80 B2 | ☒ | Method and apparatus for pre-processing and packaging class files                                                                                    | 717/166        |
| 41 | US<br>65075<br>92 B1 | Ø | Apparatus and a method for two-way data communication                                                                                                | 370/503        |
| 42 | US<br>64700<br>81 B1 | Ø | Telecommunications resource connection and operation using a service control point                                                                   | 379/221<br>.09 |
| 43 | US<br>64184<br>60 B1 | Ø | System and method for finding preempted threads in a multi-threaded application                                                                      | 718/108        |
| 44 | US<br>64164<br>95 B1 | ☒ | Implantable fluid delivery device for basal and bolus delivery of medicinal fluids                                                                   | 604/132        |
| 45 | US<br>63781<br>25 B1 | ⊠ | Debugger thread identification points                                                                                                                | 717/129        |
| 46 | US<br>63380<br>78 B1 | Ø | System and method for sequencing packets for multiprocessor parallelization in a computer network system                                             | 718/102        |
| 47 | US<br>62984<br>11 B1 | Ø | Method and apparatus to share instruction images in a virtual cache                                                                                  | 711/3          |
| 48 | US<br>62333<br>15 B1 | Ø | Methods and apparatus for increasing the utility and interoperability of peripheral devices in communications systems                                | 379/88.<br>01  |
| 49 | US<br>62298<br>80 B1 | ⊠ | Methods and apparatus for efficiently providing a communication system with speech recognition capabilities                                          | 379/88.<br>01  |
| 50 | US<br>62255<br>66 B1 | ⊠ | Self-retaining screw spacer arrangement                                                                                                              | 174/138<br>E   |
| 51 | US<br>61697<br>45 B1 | × | System and method for multi-level context switching in an electronic network                                                                         | 370/463        |
| 52 | US<br>61547<br>77 A  | ⋈ | System for context-dependent name resolution                                                                                                         | 709/227        |
| 53 | US<br>60785<br>64 A  | ⊠ | System for improving data throughput of a TCP/IP network connection with slow return channel                                                         | 370/235        |
| 54 | US<br>59667<br>02 A  | ⊠ | Method and apparatus for pre-processing and packaging class files                                                                                    | 707/1          |
| 55 | US<br>58729<br>63 A  | ⊠ | Resumption of preempted non-privileged threads with no kernel intervention                                                                           | 712/233        |
| 56 | US<br>58620<br>50 A  | ⊠ | System for preparing production process flow                                                                                                         | 700/97         |
| 57 | US<br>58127<br>60 A  | ⊠ | Programmable byte wise MPEG systems layer parser                                                                                                     | 714/49         |
| 58 | US<br>57614<br>92 A  | Ø | Method and apparatus for uniform and efficient handling of multiple precise events in a processor by including event commands in the instruction set | 712/244        |

SB data DC data LSRES0 LSRES1

FIG. 39



Sheet 28 of 45

|    | Docum<br>ent<br>ID  | ט | Title                                                                                      | Current<br>OR |
|----|---------------------|---|--------------------------------------------------------------------------------------------|---------------|
| 59 | US<br>53575<br>08 A | Ø | Connectionless ATM network support using partial connections                               | 370/397       |
| 60 | US<br>52010<br>39 A |   | Multiple address-space data processor with addressable register and context switching      | 711/201       |
| 61 | US<br>50086<br>58 A | Ø | Domed light housing for back-lit LCD display                                               | 345/87        |
| 62 | US<br>47457<br>13 A |   | Prefabricated PC shelter structure                                                         | 52/89         |
| 63 | US<br>46316<br>05 A |   | Multiple speed scanner servo system for protecting the heads and tape of helical recorders | 360/70        |

Sheet 29 of 45



FIG. 40

|    | Docum<br>ent                | υ | Title                                                                                                                                                                                                                            | Current |  |  |
|----|-----------------------------|---|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------|--|--|
|    | JP                          |   |                                                                                                                                                                                                                                  | - OK    |  |  |
| 1  | 10152<br>810 A<br>JP        |   | CONNECTING CABLE                                                                                                                                                                                                                 |         |  |  |
| 2  | 09137<br>424 A              | ☒ | INSTALLATION METHOD OF PC GIRDER                                                                                                                                                                                                 |         |  |  |
| 3  | JP<br>08249<br>183 A        | × | XECUTION FOR INFERENCE PARALLEL INSTRUCTION THREADS                                                                                                                                                                              |         |  |  |
| 4  | JP<br>03172<br>416 A        | Ø | SHEATHING WALL STRUCTURE                                                                                                                                                                                                         |         |  |  |
| 5  | JP<br>02030<br>839 A        | ⊠ | ANTICORROSIVE CONSTRUCTION IN ANCHOR SECTION OF PC STRAND                                                                                                                                                                        |         |  |  |
| 6  | WO<br>31076<br>16 A1        | × | METHOD AND APPARATUS FOR INTERNET PROTOCOL HEADERS COMPRESSION INITIALIZATION                                                                                                                                                    |         |  |  |
| 7  | WO<br>31071<br>07 A1        | ⊠ | CONTROLLING AND/OR MONITORING DEVICE USING AT LEAST A TRANSMISSION CONTROLLER                                                                                                                                                    |         |  |  |
| 8  | WO<br>99315<br>80 A1        | ⊠ | PROCESSOR HAVING MULTIPLE PROGRAM COUNTERS AND TRACE BUFFERS OUTSIDE AN EXECUTION PIPELINE                                                                                                                                       |         |  |  |
| 9  | NNRD4<br>2676               | ☒ | Multiple Terminal Simulation Tool Utilizing Multiple Separate IP Addresses                                                                                                                                                       |         |  |  |
| 10 | NNRD4<br>09126              | Ø | Multiple Inline Multi-Function Tabs for Slider Controls                                                                                                                                                                          |         |  |  |
| 11 | NN910<br>5322               | Ø | Determining Speaker-Dependent Phonetic Baseforms.                                                                                                                                                                                |         |  |  |
| 12 | NN880<br>5308               | ☒ | Card on Board Support System                                                                                                                                                                                                     |         |  |  |
| 13 | WO<br>20031<br>07107<br>A   | ⊠ | Control and monitoring device for an automation system or similar, comprises peripherals linked to a central control unit that has two transmission controllers for controlling communications and setting up a security circuit |         |  |  |
| 14 | JP<br>20032<br>68977<br>A   | ⊠ | Connection tension device for connecting precast concrete steel materials, uses embedded bearing plate to support reaction force of connection cylinder which tightens steel materials                                           |         |  |  |
| 15 | US<br>20030<br>00526<br>2 A | ⊠ | Multithreaded processor for providing high instruction fetch bandwidth, has instruction buffer and temporary instruction cache to respectively receive different blocks of cache line                                            |         |  |  |
| 16 | JP<br>20021<br>41931<br>A   | ⊠ | Router device for internet, has context ID rewriting unit which modifies similar context ID owned by several packets using unique number assigned to each address                                                                |         |  |  |
| 17 | WO<br>20011<br>6718<br>A    | × | Micro controlled function execution unit has controller maintaining multiple program counters and having logic for decoding instructions and context event arbiter to determine executable threads                               |         |  |  |
| 18 | KR<br>20010<br>02485<br>A   | ⊠ | Multi-thread microprocessor for instruction fetch                                                                                                                                                                                |         |  |  |
| 19 | US<br>61822<br>10 B         | × | Processor with multiple program counters and trace buffers outside execution pipeline                                                                                                                                            |         |  |  |
| 20 | JP<br>08194<br>612 A        | ⊠ | Program execution control for computer system - by calling second program counter according to analysis result of control data acquired after first program counter calls second program counter from control data gp.           |         |  |  |
| 21 | EP<br>52802<br>4 B          | ☒ | Error reporting for translated code execution - using address<br>correlation table to identify source information in original<br>code from error address in translated code                                                      |         |  |  |
|    | GB<br>20967<br>19 A         | Ø | Clutch arrangement for marine propulsion drive - includes<br>engageable pawl detent causing screwing of bush to engage<br>intermediate clutch part, when drive side speed drops                                                  |         |  |  |





FIG. 43



FIG. 44

|    | Docum                     |             |                                                                                                                                                                           | Current                               |
|----|---------------------------|-------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------|
|    | ent<br>ID                 | <u>ע</u>    | Title                                                                                                                                                                     | OR                                    |
| 1  | JP<br>20013<br>06263<br>A |             | MEDIA DATA STORAGE DEVICE                                                                                                                                                 |                                       |
| 2  | JP<br>20011<br>42937<br>A | ×           | SCHEDULING CORRECTNESS CHECKING METHOD AND SCHEDULE VERIFYING<br>METHOD FOR CIRCUIT                                                                                       |                                       |
| 3  | JP<br>20010<br>46747<br>A | N/1         | PROCESSING CONTROL METHOD FOR VIDEO GAME, RECORDING MEDIUM RECORDING PROCESS CONTROL PROGRAM, AND GAME DEVICE                                                             |                                       |
| 4  | JP<br>20003<br>11253<br>A | M           | THREE-DIMENSIONAL IMAGE GENERATION SYSTEM AND METHOD AND RECORDING MEDIUM                                                                                                 |                                       |
| 5  | JP<br>11296<br>385 A      | ☒           | CONTEXT CONTROLLER FOR MANAGING MULTI-TASKING BY PROCESSOR                                                                                                                |                                       |
| 6  | JP<br>11272<br>484 A      | ⊠           | DEVICE AND METHOD FOR DISPATCHING CLIENT REQUEST                                                                                                                          |                                       |
| 7  | JP<br>11023<br>420 A      |             | METHOD FOR EVALUATING LIFE OF BOLT                                                                                                                                        | ***                                   |
| 8  | JP<br>09026<br>923 A      |             | METHOD AND DEVICE FOR MANAGING OBJECT AND PROCESS INSIDE<br>DISTRIBUTED OBJECT OPERATING ENVIRONMENT                                                                      |                                       |
| 9  | JP<br>08153<br>023 A      | Ø           | INSTRUCTION EXECUTION FREQUENCY MEASURING METHOD AND MULTIPROCESSOR SYSTEM USING THE METHOD                                                                               |                                       |
| 10 | JP<br>07000<br>667 A      | ☒           | THREAD CUT CONTROL DEVICE FOR SEWING MACHINE                                                                                                                              |                                       |
| 11 | JP<br>02177<br>995 A      | ⊠           | BUTTON HOLDING MECHANISM FOR BUTTON SEWING MACHINE                                                                                                                        |                                       |
| 12 | JP<br>01292<br>430 A      | ⊠           | PARALLEL PROCESSING PROCESSOR                                                                                                                                             |                                       |
| 13 | JP<br>01274<br>240 A      | ☒           | PARALLEL PROCESSING PROCESSOR                                                                                                                                             | • • • • • • • • • • • • • • • • • • • |
| 14 | WO<br>22529<br>A1<br>EP   | ⊠           | METHOD FOR PROCESSING AN ELECTRONIC SYSTEM SUBJECTED TO<br>TRANSIENT ERROR CONSTRAINTS AND MEMORY ACCESS MONITORING<br>DEVICE                                             |                                       |
| 15 | 94236<br>6 A2<br>EP       | ☒           | Event-driven and cyclic context controller and processor employing the same                                                                                               |                                       |
| 16 | 94236<br>5 A2<br>EP       |             | Context controller having instruction-based time slice task switching capability and processor employing the same  Context controller having status-based background task |                                       |
| 17 | 94236<br>4 A2<br>EP       |             | resource allocation capability and processor employing the same                                                                                                           |                                       |
| 18 | 73547<br>5 A2<br>DE       |             | Method and apparatus for managing objects in a distributed object operating environment                                                                                   |                                       |
| 19 | 43318<br>67 A1<br>DE      |             | Method and winding device for producing wound objects, especially fruit gum whirls                                                                                        |                                       |
| 20 | 43149<br>82 A1<br>EP      | ☒           | Method for making a thread connection by splicing                                                                                                                         |                                       |
| 21 | 61736<br>1 A2             | $\boxtimes$ | Scheduling method and apparatus for a communication network.                                                                                                              |                                       |





|    | Docum<br>ent<br>ID          | σ | Title                                                                                                                                                                                                                     | Current<br>OR |  |  |  |
|----|-----------------------------|---|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------|--|--|--|
| 22 | EP<br>56742<br>8 A1         | × | Process for starting a loom and loom for effecting the same.                                                                                                                                                              |               |  |  |  |
| 23 | GB<br>22346<br>13 A         | × | Method and apparatus for switching contexts in a microprocessor                                                                                                                                                           |               |  |  |  |
| 24 | DE<br>34330<br>27 A1        | ☒ | ockstitch sewing machine                                                                                                                                                                                                  |               |  |  |  |
| 25 | EP<br>12603<br>5 A1         | ⊠ | Device for producing knitted pockets with the aid of a circular machine for products manufactured in a tubular way with continuous unidirectional movement, and machine equipped for this device.                         |               |  |  |  |
| 26 | NN921<br>1392               | ☒ | er Interface Architecture for Response Time Dependent                                                                                                                                                                     |               |  |  |  |
| 27 | US<br>20040<br>00302<br>3 A | × | Computer system processor selection method for loading processing thread, involves overwriting candidate/volunteer processor information based on comparison of load between candidate/volunteer processor                |               |  |  |  |
| 28 | US<br>66402<br>99 B         | ⊠ | Computation engine access arbitrating method for video graphics controller, involves determining priority operation instruction based on application specification prioritization scheme when multiple codes are pending  |               |  |  |  |
| 29 | US<br>20030<br>09754<br>8 A | ⊠ | Processing system e.g. pipelined computer processing system, selects address and instructions from respective context registers simultaneously for each cycle of execution of processor unit                              |               |  |  |  |
| 30 | US<br>20030<br>04651<br>7 A | ⊠ | tithreading apparatus for computer processor pipeline, has atrol unit statically scheduled to execute multiple threads round robin succession to eliminate need for communication tween pipeline stages                   |               |  |  |  |
| 31 | US<br>20020<br>10399<br>0 A | ⊠ | Multithreaded processor architecture e.g. for precession computer, includes cycle allocation table containing thread identifiers for active threads, and different execution time allotted to each thread                 |               |  |  |  |
| 32 | US<br>20020<br>06934<br>5 A | ⊠ | Very long instruction word processor includes threads comprising processing units which execute respective issue groups of VLIW packets in single clock cycle                                                             |               |  |  |  |
| 33 | US<br>20020<br>00266<br>7 A | ⊠ | Embedded processor architecture for enabling multithreading, invokes zero-time context switches between end and beginning of program instruction execution states in threads                                              |               |  |  |  |
| 34 | JP<br>20011<br>42937<br>A   | ⊠ | Scheduling correctness checking method for circuit, involves executing symbolic simulation for extracting loop invariant term for determining sufficient set of non-cyclic thread                                         |               |  |  |  |
| 35 | US<br>63413<br>47 B         | ⊠ | Multiple thread processor has thread switch logic for switching execution threads according to thread switching mode selected from multiple thread switching modes                                                        |               |  |  |  |
| 36 | WO<br>20005<br>2244<br>A    | ⊠ | Production of tubular knitwear such as bras, panties, involves producing specific lengths of tubular fabric having cylindrical shape by excluding predetermined number of needles of needlebed                            |               |  |  |  |
| 37 | US<br>64668<br>98 B         | ⊠ | Event driven logic simulation executing method for design and debug of VLSI circuit, involves creating master and slave threads for execution of logic simulation algorithm on processor platform                         |               |  |  |  |
| 38 | US<br>20020<br>09257<br>9 A | ⊠ | Weaving machine for kelim and gobelin fabrics                                                                                                                                                                             |               |  |  |  |
| 39 | BE<br>10112<br>62 A         | ⊠ | Double plush carpet weaving method - involves interlacing pile threads with two base cloths in at least three different patterns to form double-plush and flat weave regions                                              |               |  |  |  |
| 40 | KR<br>98063<br>489 A        | ⊠ | Multithread data processing system for RISC architecture, has<br>storage control unit for extracting requested data and<br>instruction for execution and instruction units based on data<br>and instruction fetch request |               |  |  |  |





FIG. 47



FIG. 48

|    | Docum<br>ent<br>ID   | σ | Title                                                                                                                                                                                                                                                                                                                                                           | Current<br>OR |
|----|----------------------|---|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------|
| 41 | EP<br>89874<br>3 B   | ⊠ | Multi-thread microprocessor for executing interrupt service routines as thread - includes context file which stores multiple contexts each associated with thread, such that multiple threads may be concurrently executed                                                                                                                                      |               |
| 42 | EP<br>72403<br>2 A   | ⊠ | Weaving machine solenoid controlled thread selecting device - includes thread selecting units each comprising hook pivoting between selection needle engaging and release positions according to selective energised state of solenoid                                                                                                                          |               |
| 43 | US<br>54902<br>72 A  | ⊠ | Creating multi-threaded processing cycle time slices for application program running on 32-bit desk-top multi-tasking operating system - obtaining 1st time slice for thread having fixed execution time, loading threadlet control block and executing threadlets for threadlet control block in order specified and within fixed execution time of time slice |               |
| 44 | EP<br>57094<br>7 A   | ⊠ | Jacquard pulley tackle - has structured layout of two rollers and roller unit with cord hooks to reduce assembly height                                                                                                                                                                                                                                         |               |
| 45 | EP<br>53601<br>0 A   | Ø | Management in real=time of multi-function processor - using time slicing to move between processes with context saving at end of time slice to allow fast context restoration                                                                                                                                                                                   |               |
| 46 | DE<br>40205<br>50 C  | ☒ | Warp knitting machine with needle bed with guide bars - provides knitted goods with stable surface                                                                                                                                                                                                                                                              |               |
| 47 | GB<br>22346<br>13 A  | ⊠ | Fast microprocessor, switching contexts when cache miss occurs - couples copies of state elements to multiplexer to save context of instructions and resume execution in one cycle                                                                                                                                                                              |               |
| 48 | EP<br>38347<br>4 A   | Ø | Programmed peripheral controller - uses cyclic reassessment of existing and pending contexts and priorities to select correct context to execute                                                                                                                                                                                                                |               |
| 49 | EP<br>36460<br>0 B   | ☒ | Round screw machining method - executing repetition of screw threading cycles obtained by NC tool data, while shifting sequentially cycle start point along arc shape                                                                                                                                                                                           |               |
| 50 | EP<br>29547<br>0 A   | ☒ | Call transfer procedure in computer controlled exchange - first sending unanswered ringing current to selected group of phones before extending to all                                                                                                                                                                                                          |               |
| 51 | JP<br>63275<br>389 A | × | Appts. to detect remaining amt. of bobbin thread in sewing machine - contg. bobbin thread remaining amt. counters for different types of bobbins, and selector                                                                                                                                                                                                  |               |
| 52 | JP<br>86042<br>590 B | ⊠ | Control unit for automatic sewing machine - in which mode is selected for next cycle when abnormal condition is detected (J5 6.5.82)                                                                                                                                                                                                                            |               |
| 53 | DE<br>35866<br>03 G  | ⊠ | Threaded interpretive language data processor - effects multiple data selection and arithmetic operations in single clock cycle using parameter stack implemented in hardware                                                                                                                                                                                   |               |
| 54 | EP<br>12603<br>5 A   | ☒ | Tubular knitting machine with rotary needle selection device - for producing hose with integral heels, ends, gussets etc.                                                                                                                                                                                                                                       |               |
| 55 | BE<br>90007<br>5 A   | ⊠ | Weft yarn spool drives for shuttle free loom - with stepping drive electric motors under programmed pulse control with angular references                                                                                                                                                                                                                       |               |
| 56 | EP<br>11415<br>3 A   | ⊠ | Circular profile knitting machine with variable thread feed rate - to eliminate residual tension in yarns forming stocking heels or other profile variations                                                                                                                                                                                                    |               |
| 57 | SU<br>10965<br>29 A  | ⊠ | Cycling bend and twist test for flexible components - with reciprocating bending surfaces linked with torsion oscillation of end grip                                                                                                                                                                                                                           |               |
| 58 | EP<br>80581<br>A     | ☒ | Loom control system - applies additional warp tension at re-start, varying according to loom angle sensed by detector                                                                                                                                                                                                                                           |               |
| 59 | BE<br>89338<br>9 A   | ☒ | Weft supply elements, selected by electromagnets - move from wait to transfer positions against spring restraint, for shed insertion                                                                                                                                                                                                                            |               |
| 60 | SU<br>50412<br>8 A   | ☒ | Textile threads flexure destructive testing - uses complex waveform vibrator and two clamps with fixing jaws                                                                                                                                                                                                                                                    |               |
| 61 | DE<br>19215<br>76 A  | Ø | Actuating weft threads during colour change in - shuttless<br>weaving looms                                                                                                                                                                                                                                                                                     |               |

Sheet 34 of 45





|   | L # | Hits  | Search Text                                                                                                      | DBs                |
|---|-----|-------|------------------------------------------------------------------------------------------------------------------|--------------------|
| 1 | L1  | 683   | <pre>(select\$3 interleav\$3 execut\$3) near10 ((cycl\$6 robin) near20 (multithread\$3 thread\$3 context))</pre> | USPAT;<br>US-PGPUB |
| 2 | L2  | 15106 | <pre>select\$3 near10 (pc pa ip ((instruction program) adj2 (counter address)))</pre>                            | USPAT;<br>US-PGPUB |
| 3 | L4  | 58    | 2 near99 (multithread\$3 thread\$3 context) and pipelin\$3                                                       | USPAT;<br>US-PGPUB |
| 4 | L5  | 327   | 1 and pipelin\$3                                                                                                 | USPAT;<br>US-PGPUB |
| 5 | L6  | 189   | (multithread\$3 thread\$3 context).ab,ti. and 5                                                                  | USPAT;<br>US-PGPUB |

ATPGOUT(15:14)—Output to dedicated pins for AIPG. ATPGIN(15:14)—Input from dedicated pins for ATPG.

MAXADDR-Output to TAP indicates maximum index dual port arrays.

BSTAMSB-Output to TAP indicates maximum count for chain from the ICNXTBLK and ICTAGV arrays.

chain from the ICSTORE and ICPDAT arrays.

BSTIDOUT—Output to TAP indicates the result of the data dual pon.

PORTSEL-Input from TAP indicates to select the second panem.

BSTFALSE-Input from TAP indicates to invert the test latch of registers.

BSTSHF2—Input from TAP indicates shifting of the slave latch of registers.

UPDOWN-Input from TAP indicates counting up or down.

compare input for flushing the result registers. mode, the result latch should use BSTDIN instead of the

FLUSHON—Input from TAP indicates flushing register normal burn-in patterns.

input registers. The input can be from the TDI pin or BZLDIM-Input from TAP indicates the test pattern to the

BSTINCR-Input from TAP indicates to increment the BSTRST-Input from TAP indicates to reset the counter. input registers.

BSTWR—Input from TAP indicates to write the array from compare to set the result.

BSTRD-Input from TAP indicates to read the array and BSTRUM-Input from TAP indicates to start the BIST. should not be eache if outside of the page.

of bit 20 for backward compatible with 8086. The line

BITZOMASK-Input from CMASTER indicates masking PFREPLCOL(2:0).

or sliasing. This signal may be redundant with 30 and send to 4 decode units. way associative for invalidating up to 2 lines in the Icache PF\_SNP\_COL(2:0)—Input from CMASTER indicates the

> index for invalidating up to 2 lines in the leache or for to 2 lines in the leache.

[2]IC\_INV(1:0)—Input from CMASTER to invalidate up updated.

LS\_CS\_WR—Input from LSSEC indicates the CS is being pre-fetch only.

LS2ICNOIC-Input from LSSEC indicates no caching, not be eached.

BIU\_NC—Input from BIU indicates the current line should when the ICPDAT and the valid bit is written.

address for the current line is written into the cache, the

PF\_IC\_XFER-Output to CMASTER indicates the for the current instruction address.

TLB\_MISS\_PF-Input from MMU indicates TLB miss the current instruction address.

MMUPFPGFLT—Input from MMU indicates page fault for which crosses the line boundary. less of aliasing. This is for pre-fetching of instruction

line of instruction should be fetched from external regard-IC\_EXT\_RD-Output to CMASTER indicates the next

tion or the leache must be re-fetched from a new line. instruction must be detected in the current line of instruction of the return instruction of the fetched line. The return RETPRED—Output to Idecode indicates the current predic-

internal clock for each set of the array. Pre-charge is gated of the array. The self-time column is used to generate 65 charge and the row decoder should be crossed in the middle array to drive 128 column each way. Basically, the presenamp. The row decoder should be in the middle of the RAM cells, pre-charge, 64 RAM cells, write buffer and optimal performance the layout of the column should be 64 BSTITOUT —Output to TAP indicates the result of the data 60 discussed in detail below in the ICALIGN block section. For decode unit should also be decoded; this topic will be muxing information relating to which byte is going to which arrangement, the input/output to each set is 16-bit buses. The the data is routed to the ICALIGN block. With this se ciative muxing from the 8 TAG-HITs is performed before the sets consist of 2-byte of instructions. The 8-way assoown decoder. The decoder is in the center of the set. Each of by 256 columns. The array set in this documentation has its of 32K bytes of instructions organized as 8 sets of 128 rows BSTSHFI—Input from TAP indicates shifting of the master 50 pre-decode data, as shown in FIG. 9. The ICSTORE consists The ICSTORE on Processor 500 does not store the ICZLOKE OKCYNIZYLION

Processor 500 from using dynamic pre-charged buses. directly from the pads to the leache; this is a step to keep as dedicated bus should be used to transfer instructions written to the array. With the pre-fetch buffer in the Icache, address pointer is still in the same block, the data will be for instructions sent to the decode unit. As long as the valid bit for instructions written into the cache and a valid bit 40 instructions. The pre-fetch buffer consists of a counter and a there is space in the pre-fetch buffer for another line of buffers is two, and request will be made to BIU as soon as soon as they are valid on the bus. The number of pre-fetch instruction and the instructions is available to the leache as 35 regardless of the pre-decode information or the taken branch

the array. In this way, the data will be written into the Icache should be built into the ICSTORE; the input/output path of be written directly into the leache, the pre-fetch buffer Since the instruction fetching from external memory will

8-byte to form a 24-byte line for the alignment unit to select decoded. The output IB(127:0) is merged with the previous mation is written into the ICPDAT as the whole line is the pre-fetched or eached instructions. The pre-decode infor-PF\_IDX(6:0)—Input from CMASTER indicates the array 25 The ICPRED has only one input from the IB(127:0) for both through the pre-fetch buffers before sending to the ICPRED. replaced in the leache. The instructions must first pass of instruction) and part of the wrapping instruction is are branching to the opcode and skipping the prefix (punning 20 instruction must be pre-decoded again. The possible cases end bits must be detected for each instruction or else the line or instructions wrapping to the next line, the start and to validate the instruction. If branching to the middle of the put in the data and tag. The start and end bits are sufficient L2 should be updated with the physical address. This is 15 of instruction, the CMASTER tells the leache which way to procedure is done by the CMASTER. Along with each line whole cache-line of instructions, 16-byte. The replacement need not be modified. The linear valid bit is used for the to MROM. With these simple prefixes, the instruction bytes 10 all other prefixes will take an extra cycle in decoding or go multiple prefixes including 0x67 is allowed for multi-prefix, 0x66 and 0x0F is allowed for Processor 500's fast path, and waits for external instructions. Only single byte prefix of lesche. This should not be a problem since the leache is idle the externally fetched instructions will be latched into the instruction are 3 bits; start bit, end bit, and functional bit. All ROPs are needed. The pre-decode bits with each byte of Processor 500 executes fast X86 instructions directly, no

|    | Docum                                                                                | σ                                                                                  | Title                                                                                                        | Current |
|----|--------------------------------------------------------------------------------------|------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------|---------|
|    | US US                                                                                |                                                                                    |                                                                                                              | - OA    |
| 1  | 20040<br>07391<br>0 A1<br>US                                                         |                                                                                    | Method and apparatus for high speed cross-thread interrupts in a multithreaded processor                     | 719/310 |
| 2  | 20040<br>07390<br>5 A1                                                               | ⋈                                                                                  | Method and apparatus to quiesce a portion of a simultaneous multithreaded central processing unit            | 718/101 |
| 3  | US<br>20040<br>07378<br>1 A1                                                         | ☒                                                                                  | Method and apparatus for token triggered multithreading                                                      | 712/235 |
| 4  | US<br>20040<br>07377<br>9 A1                                                         | ⊠                                                                                  | Method and apparatus for register file port reduction in a multithreaded processor                           | 712/225 |
| 5  | US<br>20040<br>07377<br>8 A1                                                         | ☒                                                                                  | Parallel processor architecture                                                                              | 712/220 |
| 6  | US<br>20040<br>07377<br>2 A1                                                         | ⊠                                                                                  | Method and apparatus for thread-based memory access in a multithreaded processor                             | 712/1   |
| 7  | US 20040 07052 6 A1  ARITHMETIC DECODING METHOD AND AN ARITHMETIC DECODING APPARATUS |                                                                                    |                                                                                                              | 341/107 |
| 8  | 20040<br>05499<br>0 A1                                                               | US 20040 Post-pass binary adaptation for software-based speculative precomputation |                                                                                                              | 717/124 |
| 9  | US<br>20040<br>05488<br>0 A1                                                         | ☒                                                                                  | Microengine for parallel processor architecture                                                              | 712/245 |
| 10 | US<br>20040<br>03485<br>8 A1                                                         | ☒                                                                                  | Programming a multi-threaded processor                                                                       | 718/108 |
| 11 | US<br>20040<br>03475<br>9 A1                                                         | ⊠                                                                                  | Multi-threaded pipeline with context issue rules                                                             | 712/1   |
| 12 | US<br>20040<br>00658<br>4 A1                                                         |                                                                                    | Array of parallel programmable processing engines and deterministic method of operating the same             | 718/107 |
| 13 | US<br>20030<br>23339<br>4 A1                                                         | ⊠                                                                                  | Method and apparatus for ensuring fairness and forward progress when executing multiple threads of execution | 718/107 |
| 14 | US<br>20030<br>22597<br>5 A1                                                         | ⊠                                                                                  | Method and apparatus for multithreaded cache with cache eviction based on thread identifier                  | 711/133 |
| 15 | US<br>20030<br>21288<br>1 A1                                                         |                                                                                    | Method and apparatus to enhance performance in a multi-threaded microprocessor with predication              | 712/226 |
| 16 | US<br>20030<br>20042<br>4 A1                                                         | ⊠                                                                                  | Master-slave latch circuit for multithreaded processing                                                      | 712/228 |
| 17 | US<br>20030<br>19192<br>7 A1                                                         | ⊠                                                                                  | Multiple-thread processor with in-pipeline, thread selectable storage                                        | 712/228 |

SLOKVCE **VDDKESSES EKOM V KELINKA SLVCK** CONFIGURED TO PREDICT RETURN SUPERSCALAR MICROPROCESSOR

 Field of the Invention BACKGROUND OF THE INVENTION

tion mechanisms for predicting the address of a return and, more particularly, to speculative return address predic-This invention is related to the field of microprocessors

2. Description of the Relevant Art instruction within microprocessors.

Instructions and computed values are captured by memory instruction processing pipelines complete their tasks. an interval of time in which the various stages of the the design. As used herein, the term "clock cycle" refers to choosing the shortest possible clock cycle consistent with by executing multiple instructions concurrently and by Superscalar microprocessors achieve high performance

falling edge of the clock signal. element may capture a value according to the rising or signal defining the clock cycle. For example, a memory elements (such as registers or arrays) according to a clock

of the instruction subsequent to the call instruction is saved tion" is an instruction used to call a subroutine. The address "subroutine call instruction" or, more briefly, a "call instrucmultiple places within a program to perform its function. A subroutine. Therefore, a subroutine may be called from then returns to the instruction following the call to the subroutine performs a function that a program requires, and subroutine called by that program, among other things. A 35 are typically predicted when that branch instruction is useful for passing information between a program and a architecture includes a "stack" area in memory. The stack is will be appreciated by those skilled in the art, the x86 geously retain compatibility with this body of software. As designed in accordance with the x86 architecture advanta- $_{
m 30}$  predicted address. If the prediction is incorrect, or "misprecessors embodying this architecture. Microprocessors large body of software exists which runs only on microprowidely accepted in the computer industry, and therefore a cessor architecture. The x86 microprocessor architecture is their microprocessors in accordance with the x86 micropro-Many superscalar microprocessor manufacturers design

performed, a number of bytes specified by the pop command mented ESP register value. When a pop command is value is then stored at the address pointed to by the decrebytes) of the value specified by the push command. The is performed, the ESP register is decremented by the size (in be performed is a "pop command". When a push command the stack is referred to as a "pop", and requesting that a pop is a "push command". The action of removing a value from 55 difficult to predict in other microprocessor architectures as known as a "push", and requesting that a push be performed on the stack. The action of placing a value on the stack is Therefore, the top of the stack contains the last item placed order and are removed from the stack in the reverse order. structure in which values are placed on the stack in a certain stack. A stack structure is a Last-In, First-Out (LIFO) address in memory which currently forms the top of the In the x86 architecture, the ESP register points to the to resume at the return address is a "return instruction". address". The instruction which causes program execution 45 command, as described above. This type of branch instrucquent to the call instruction is referred to as the "return in a storage location. The address of the instruction subse-

x86 microprocessor architecture are the subroutine call and An example of the use of push and pop commands in the mented by the number of bytes. by the pop command, and then the ESP register is increare copied from the top of the stack to a destination specified

addresses. One address is the address immediately following instruction to be fetched from one of at least two possible A branch instruction is an instruction which causes the next microprocessor architectures, contains branch instructions. The x86 microprocessor architecture, similar to other the stack and causes that address to be fetched by the subroutine pope the next instruction address from the top of indexing into the stack. After completing execution, the called and executes, accessing any operands it may need by pushing the return address onto the stack. The subroutine is pushing the operands for the subroutine onto the stack, then processor architecture, a typical subroutine call involves return instructions, as mentioned above. In the x86 micro-

the branch instruction. This address is referred to as the

flag which is set by a previously executed instruction. the next sequential address based on a particular condition instructions typically select between the target address and target address" or simply the "target address". Branch by the branch instruction, and is referred to as the "branch "next sequential address". The second address is specified

command, as described above, as well as a branch instruc-

sor architecture). The CALL instruction is therefore a push

by the CALL instruction (i.e. the CALL instruction is the

and then instructions are fetched from an address specified

immediately following the call to be pushed onto the stack,

instruction which causes the address of the instruction

normally used in conjunction with the CALL instruction.

use of "fake return" instructions. Return instructions are

may differ. For similar reasons, the return address may be

ESP register at the time the return instruction is executed

time the return instruction is decoded and the value of the

instruction is executed. The value of the ESP register at the

be indicated by the value in the ESP register when the return

the return address is stored on the stack in a location that will

is decoded, unlike some other branch instructions. Instead,

neturn address) is not readily available when the instruction

tion is difficult to predict because the target address (or

processor architecture). The return instruction is a pop

instruction (the return instruction defined for the x86 micro-

dict in the x86 microprocessor architecture is the RET

be considered to be branches which always select the target

microprocessor. Subroutine call and return instructions may

branch prediction scheme and the configuration of the

decoded or when instructions are fetched, depending on the

and the correct instructions are fetched. Branch instructions

tion are discarded from the instruction processing pipeline

executes), then the instructions following the branch instruc-

dicted" (as determined when the branch instruction

speculatively fetch and execute instructions residing at the

is chosen, the resulting superscalar microprocessor may

tion will select when executed. When the prediction method

performance) or predict which address the branch instruc-

fetching until the branch instruction executes (reducing

executes, superscalar microprocessors either stall instruction

instruction is not known until the branch instruction

Since the next instruction to be executed after the branch

40 address.

A particularly difficult type of branch instruction to pre-

Return address prediction is further complicated by the

60 The CALL instruction is another special type of branch

so subroutine call instruction defined for the x86 microproces-

|    | Docum                                                                                                                                   | σ | Title                                                                                                               | Current |  |
|----|-----------------------------------------------------------------------------------------------------------------------------------------|---|---------------------------------------------------------------------------------------------------------------------|---------|--|
| 18 | US<br>20030<br>19186<br>6 A1                                                                                                            | ⊠ | Registers for data transfers                                                                                        | 719/313 |  |
| 19 | US<br>20030<br>18814<br>1 A1                                                                                                            | Ø | Time-multiplexed speculative multi-threading to support single-threaded applications                                | 712/235 |  |
| 20 | US<br>20030<br>18234<br>6 A1                                                                                                            | ⊠ | Method and apparatus for configuring arbitrary sized data paths comprising multiple context processing elements     | 708/700 |  |
| 21 | US<br>20030<br>17225<br>6 A1                                                                                                            | Ø | Use sense urgency to continue with other heuristics to determine switch events in a temporal multithreaded CPU      | 712/228 |  |
| 22 | US<br>20030<br>16367<br>5 A1                                                                                                            | ⊠ | Context switching system for a multi-thread execution pipeline loop and method of operation thereof                 | 712/228 |  |
| 23 | US 20030 16366 9 A1 Configuration of multi-cluster processor from single wide thread to two half-width threads                          |   |                                                                                                                     |         |  |
| 24 | US 20030 16366 8 A1 Local control of multiple context processing elements with configuration contexts                                   |   | 712/15                                                                                                              |         |  |
| 25 | US 20030 15888 5 A1  Method and apparatus for controlling the processing priority between multiple threads in a multithreaded processor |   | 718/108                                                                                                             |         |  |
| 26 | US<br>20030<br>15423<br>5 A1                                                                                                            |   | Method and apparatus for controlling the processing priority between multiple threads in a multithreaded processor  | 718/108 |  |
| 27 | US<br>20030<br>14996<br>4 A1                                                                                                            | ⊠ | Method of executing an interpreter program                                                                          | 717/138 |  |
| 28 | US<br>20030<br>14517<br>3 A1                                                                                                            | ☒ | Context pipelines                                                                                                   | 711/140 |  |
| 29 | US<br>20030<br>13571<br>6 A1                                                                                                            | ☒ | Method of creating a high performance virtual multiprocessor<br>by adding a new dimension to a processor's pipeline | 712/220 |  |
| 30 | US<br>20030<br>13571<br>1 A1                                                                                                            |   | Apparatus and method for scheduling threads in multi-threading processors                                           | 712/200 |  |
| 31 | US<br>20030<br>12640<br>3 A1                                                                                                            | ⊠ | Method and apparatus for retiming in a network of multiple context processing elements                              | 712/11  |  |
| 32 | US<br>20030<br>12089<br>6 A1                                                                                                            | × | System on chip architecture                                                                                         | 712/32  |  |
| 33 | US<br>20030<br>10594<br>4 A1                                                                                                            |   | Method and apparatus to quiesce a portion of a simultaneous multithreaded central processing unit                   | 712/220 |  |
| 34 | US<br>20030<br>10590<br>1 A1                                                                                                            | ☒ | PARALLEL MULTI-THREADED PROCESSING                                                                                  | 710/240 |  |

stored return addresses comprise return addresses correinstruction from a plurality of stored return addresses. The memory. The microprocessor includes a return prediction system, comprising a microprocessor coupled to a main The present invention further contemplates a computer

the stored return addresses. address corresponding to one of the return instructions from 60 return prediction unit is configured to predict a return instructions in a return stack structure. Furthermore, the ured to store return addresses corresponding to the call branch prediction unit, the return prediction unit is configthe call instructions and the return instructions from the 55 and return instructions. Coupled to receive an indication of taken, wherein branch instructions include call instructions is configured to predict branch instructions taken or not unit and a return prediction unit. The branch prediction unit superscalar microprocessor comprising a branch prediction

Broadly speaking, the present invention contemplates a

mance may be increased by recovering the return prediction important feature of superscalar microprocessors, perfortion. Because mispredicted branch recovery is often an addresses correctly following a mispredicted branch instrucreturn prediction unit may continue to predict return recovers from mispredicted branches. In other words, the misprediction. Advantageously, the return prediction unit contents of the return stack storage with respect to the The results of the comparisons may be used to adjust the prediction unit upon detection of a branch misprediction. tags may be compared to a branch tag conveyed to the return return instructions associated with the return address. These The call tag and return tag respectively identify call and stores a call tag and a return tag with each return address In one embodiment, the return stack storage additionally

the return address. return instruction and execution of the instructions stored at according to the decreased time between execution of the superscalar microprocessors. Performance may be increased quickly than was previously achievable using conventional at the target of the return instruction may be fetched more cessing pipeline of the microprocessor. Instructions residing dicted for return instructions early in the instruction proinstructions. Advantageously, return addresses may be prereturn addresses associated with previously detected call return stack storage is a stack structure configured to store according to a return stack storage included therein. The configured to predict return addresses for return instructions The present microprocessor employs a return prediction unit a microprocessor in accordance with the present invention. The problems outlined above are in large part solved by

## **SOMMARY OF THE INVENTION**

ing the target address of a return instruction is desired. a PUSH instruction, for example. A mechanism for predictstack. This address may be placed on the stack by executing address provided by a CALL instruction is at the top of the 10 are executed when a return address other than a return "Fake return" instructions are return instructions which

tion immediately following the CALL instruction. which causes instruction execution to resume at the instrucand the subroutine typically ends with a return instruction 5 tion can therefore be used to call a subroutine in a program, return instruction as the return address. The CALL instruc-CALL instruction is the address intended to be used by the tion. The instruction address placed on the stack by the

coupled to a respective functional unit 212A-212D (referred stations 210), and each reservation station 210A-210D is structure configured to predict a return address of a return 65 units 210A-210D (referred to collectively as reservation unit 208A-208D is coupled to respective reservation station (referred to collectively as decode units 208). Each decode tion cache 20st and a plurality of decode units 208A-208D instruction alignment unit 206 is coupled between instruction unit 220 coupled to an instruction cache 204. An includes a prefetch/predecode unit 202 and a branch predicembodiment of FIG. I, superscalar microprocessor 200 with the present invention is shown. As illustrated in the ment of a superscalar microprocessor 200 in accordance Turning now to FIG. I, a block diagram of one embodi-

## INVENTION DETAILED DESCRIPTION OF THE

sppended claims. spirit and scope of the present invention as defined by the modifications, equivalents and alternatives falling within the disclosed, but on the contrary, the intention is to cover all are not intended to limit the invention to the particular form however, that the drawings and detailed description thereto herein be described in detail. It should be understood, are shown by way of example in the drawings and will tions and alternative forms, specific embodiments thereof While the invention is susceptible to various modifica-

HCS. 2-67 depict a superscalar microprocessor.

plary instruction stream shown in FIG. 4A. shown in FIG. 2 after completing execution of the exem-FIG. 4E shows the contents of the return address storage

35 tions from the exemplary instruction stream shown in FIG. shown in FIG. 2 after the execution of several more instruc-FIG. 4D shows the contents of the return address storage from the exemplary instruction stream shown in FIG. 4A. shown in HG. 2 after the execution of several instructions

FIG. 4C shows the contents of the return address storage Shown in FIG. 4A.

shown in FIG. 2 prior to executing the instruction stream FIG. 4B shows the contents of the return address storage illustrate the function of the present return prediction unit. FIG. 4A is an exemplary instruction stream used to call and return instructions.

return instructions, illustrating instruction flow according to FIG. 3 is a diagram of instructions, including call and 20 unit shown in FIG. 2.

stack storage which may be included in the return prediction FIG. 2A is a diagram of another embodiment of a return

cache or the branch prediction unit shown in FIG. I. prediction unit which may be included within the instruction FIG. 2 is a block diagram of one embodiment of a return

and a branch prediction unit. superscalar microprocessor including an instruction cache FIG. 1 is a block diagram of one embodiment of a

mgs in which: description and upon reference to the accompanying drawbecome apparent upon reading the following detailed Other objects and advantages of the invention will

## BRIEF DESCRIPTION OF THE DRAWINGS

manipulation by the microprocessor. the microprocessor and further configured to store data for memory is configured to store instructions for execution by sponding to previously fetched call instructions. The main

|    | Docum                        |   |                                                                                                                                                                                | Current |  |  |
|----|------------------------------|---|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------|--|--|
|    | ent<br>ID                    | σ | Title                                                                                                                                                                          | OR      |  |  |
| 35 | US<br>20030<br>09754<br>8 A1 | Ø | Context execution in pipelined computer processor                                                                                                                              | 712/228 |  |  |
| 36 | US<br>20030<br>09365<br>5 A1 | ⊠ | Multithread embedded processor with input/output capability                                                                                                                    | 712/228 |  |  |
| 37 | US<br>20030<br>08861<br>0 A1 | ⊠ | Multi-core multi-thread processor                                                                                                                                              | 718/107 |  |  |
| 38 | US<br>20030<br>06125<br>8 A1 | ⊠ | Method and apparatus for processing an event occurrence for at least one thread within a multithreaded processor                                                               | 718/102 |  |  |
| 39 | US<br>20030<br>04652<br>1 A1 | ☒ | Apparatus and method for switching threads in multi-threading processors                                                                                                       | 712/228 |  |  |
| 40 | US<br>20030<br>04651<br>7 A1 |   | Apparatus to facilitate multithreading in a computer processor pipeline                                                                                                        | 712/214 |  |  |
| 41 | US<br>20030<br>03880<br>8 A1 | ☒ | Method, apparatus and article of manufacture for a sequencer in a transform/lighting module capable of processing multiple independent execution threads                       | 345/506 |  |  |
| 42 | US<br>20030<br>03722<br>8 A1 | ⊠ | System and method for instruction level multithreading scheduling in a embedded processor                                                                                      |         |  |  |
| 43 | US<br>20030<br>02383<br>5 A1 | ☒ | Method and system to perform a thread switching operation within a multithreaded processor based on dispatch of a quantity of instruction information for a full instruction   | 712/214 |  |  |
| 44 | US<br>20030<br>02383<br>4 A1 | ⊠ | Method and system to insert a flow marker into an instruction stream to indicate a thread switching operation within a multithreaded processor                                 | 712/214 |  |  |
| 45 | US<br>20030<br>02365<br>9 A1 | ⊠ | Method and apparatus for thread switching within a multithreaded processor                                                                                                     | 718/102 |  |  |
| 46 | US<br>20030<br>02365<br>8 A1 | ⊠ | Method and system to perform a thread switching operation within a multithreaded processor based on detection of the absence of a flow of instruction information for a thread | 718/102 |  |  |
| 47 | US<br>20030<br>02072<br>0 A1 | ⊠ | Method, apparatus and article of manufacture for a sequencer in a transform/lighting module capable of processing multiple independent execution threads                       | 345/506 |  |  |
| 48 | US<br>20030<br>01868<br>7 A1 | ☒ | Method and system to perform a thread switching operation within a multithreaded processor based on detection of a flow marker within an instruction information               | 718/102 |  |  |
| 49 | US<br>20030<br>01868<br>6 A1 | ☒ | Method and system to perform a thread switching operation within a multithreaded processor based on detection of a stall condition                                             | 718/102 |  |  |
| 50 | US<br>20030<br>01868<br>5 A1 | ⊠ | Method and system to perform a thread switching operation within a multithreaded processor based on detection of a branch instruction                                          | 718/102 |  |  |
| 51 | US<br>20030<br>01461<br>2 A1 | ⊠ | MULTI-THREADED PROCESSOR BY MULTIPLE-BIT FLIP-FLOP GLOBAL<br>SUBSTITUTION                                                                                                      | 712/215 |  |  |

TABLE 1-continued

| instruction     |            |       |       |        |
|-----------------|------------|-------|-------|--------|
| Last byte of    | x          | τ     | x     | 8-1    |
| of instruction  |            |       |       |        |
| Not last byte   | x          | 0     | X     | 8–t    |
| stsb ətsibəmmi  |            |       |       |        |
| 3-8 indicates   |            |       |       |        |
| set in bytes    |            |       |       |        |
| functional bit  |            |       |       |        |
| the second      |            |       |       |        |
| immediate data; |            |       |       |        |
| Displacement or | τ          | X     | 0     | 8-€    |
| SIB byte        |            |       |       |        |
| TO MVA boM      | 0          | X     | 0     | 8–€    |
| xAmq zi siyd    |            |       |       | _      |
| Meaning         | Value      | Appre | Agine | Number |
|                 | Бā         | ы́В   | ńЯ    | Byte   |
|                 | Functional | End   | Start | Instr  |

As stated previously, in one embodiment certain instructions within the x86 instruction set may be directly decoded by decode unit 208. These instructions are referred to as "fast path" instructions. The remaining instructions of the x86 instruction set are referred to as "MROM instructions". 209. More specifically, when an MROM instruction instruction into a subset of defined fast path instructions to effectuate a desired operation. A listing of exemplary x86 instructions categorized as fast path instructions to effectuate a desired operation. A listing of exemplary x86 instructions categorized as fast path instructions as well as a description of the manner of handling both fast path and description of the manner of brandling both fast path and MROM instructions will be provided further below.

Variable byte length instructions thom instruction cache 2044 variable byte length instructions from instruction cache 2044 to fixed issue positions formed by decode units 208A-208D.

In one embodiment, instruction alignment unit 206 independently and in parallel selects instructions from three groups of instruction bytes provided by instruction cache 2048 and arranges these bytes into three groups of preliminary issue positions. Each group of instruction bytes. The preliminary issue issue positions are then merged together to form the final issue positions are then merged together to form the final issue positions, each of which is coupled to one of decode units 208.

Before proceeding with a detailed description of the return address prediction mechanism employed within microprocessor 200, general aspects regarding other subprocessor 200 to FIG. I will be described. For the embodine of FIG. I, each of the decode units 208 includes decoding circuity for decoding the predetermined fast path decoding referred to above. In addition, each decode unit corresponding reservation station unit 210A-210D. Output signals from the decode units 208 include bit-encoded execution instructions for the functional units 212 as well as operand address information, immediate data and/or dispersent data.

The superscalar microprocessor of FIG. I supports out of order execution, and thus employs reorder buffer 216 to keep track of the original program sequence for register read and write operations, to implement register renaming, to allow for speculative instruction execution and branch misprediction recovery, and to facilitate precise exceptions. As will be appreciated by those of skill in the art, a temporary storage

to collectively as functional units 212). Decode units 208, reservation stations 210, and functional units 212 are further coupled to a reorder buffer 216, a register file 218 and a load/store unit 222. A data cache 224 is finally shown coupled to load/store unit 222, and an MROM unit 209 is 5 shown coupled to instruction alignment unit 206.

Generally speaking, instruction cache 204 is a high speed cache memory provided to temporarily store instructions prior to their dispatch to decode units 208. In one prior to their dispatch to decode units 208. In one to 32 kilobytes of instruction cache 20% is configured to cache up to 32 kilobytes of instruction code organizes of 8 bits). During the captainor, instruction code from a main memory (not shown) through prefetch/prenoted that instruction cache 20% could be implemented in a set-associative, a tion cache 20% could be implemented in a set-associative, a fully-associative, or a direct-mapped configuration.

Prefetch/predecode unit 202 is provided to prefetch instruction code from the main memory for storage within instruction cache 204. In one embodiment, prefetch/ predecode unit 202 is configured to burst 64-bit wide code from the main memory into instruction cache 204. It is understood that a variety of specific code prefetching techniques and algorithms may be employed by prefetch. 25 predecode unit 202

As prefetch/predecode unit 202 fetches instructions from the main memory, it generates three predecode bits associated with each byte of instruction code: a start bit, an end bit, and a "functional" bit. The predecode bits form tags indicative of the boundaries of each instruction. The predecode tags may also convey additional information such as decode units 208 or whether the instruction must be executed by invoking a microcode procedure controlled by executed by invoking a microcode procedure controlled by BROM unit 209, as will be described in greater detail below.

displacement or immediate data. MODRM or an SIB byte, or whether the byte contains instruction byte numbers 3-8 indicate whether the byte is a first byte is a prefix byte. The functional bit values for that in situations where the opcode is the second byte, the byte, and is set if the opcode is the second byte. It is noted of a particular instruction is cleared if the opcode is the first instruction is cleared. The functional bit for the second byte the functional bit associated with the first byte of the instruction can be directly decoded by the decode units 203, byte of the instruction is set. On the other hand, if the decode units 208, the functional bit associated with the first If a particular instruction cannot be directly decoded by the the last byte of an instruction, the end bit for that byte is set. an instruction, the start bit for that byte is set. If the byte is indicated within the table, if a given byte is the first byte of Table 1 indicates one encoding of the predecode tags. As

TABLE 1

| Mesaing                       | Functional<br>Bit<br>Value | End<br>Bit<br>Value | Start<br>Bit<br>Value | Instr.<br>Byte<br>Number |
|-------------------------------|----------------------------|---------------------|-----------------------|--------------------------|
| Fast decode                   | 0                          | x                   | Ţ                     | Ţ                        |
| MROM instr.                   | I                          | X                   | τ                     | Ţ                        |
| olyd leni ai sbooqO           | 0                          | X                   | 0                     | 7                        |
| Opcode is this<br>byte, first | τ                          | x                   | 0                     | 7                        |

|    | Docum<br>ent<br>ID           | σ | Title                                                                                                                        | Current<br>OR |
|----|------------------------------|---|------------------------------------------------------------------------------------------------------------------------------|---------------|
| 52 | US<br>20030<br>01422<br>1 A1 | ☒ | System and method to avoid resource contention in the presence of exceptions                                                 | 702/186       |
| 53 | US<br>20030<br>00964<br>8 A1 | ⊠ | Apparatus for supporting a logically partitioned computer system                                                             | 711/202       |
| 54 | US<br>20030<br>00526<br>6 A1 | ⊠ | Multithreaded processor capable of implicit multithreaded execution of a single-thread program                               | 712/220       |
| 55 | US<br>20030<br>00526<br>3 A1 | ☒ | Shared resource queue for simultaneous multithreaded processing                                                              | 712/218       |
| 56 | US<br>20030<br>00526<br>2 A1 | ☒ | Mechanism for providing high instruction fetch bandwidth in a multi-threaded processor                                       | 712/207       |
| 57 | US<br>20020<br>19917<br>3 A1 | ⊠ | System, method and article of manufacture for a debugger capable of operating across multiple threads and lock domains       | 717/129       |
| 58 | US<br>20020<br>18883<br>2 A1 | ☒ | Method and apparatus for providing local control of processing elements in a network of multiple context processing elements | 712/228       |
| 59 | US<br>20020<br>15699<br>9 A1 | ⊠ | Mixed-mode hardware multithreading                                                                                           | 712/228       |
| 60 | US<br>20020<br>13871<br>7 A1 | ☒ | Multiple-thread processor with single-thread interface shared among threads                                                  | 712/235       |
| 61 | US<br>20020<br>12922<br>7 A1 | ⊠ | Processor having priority changing function according to threads                                                             | 712/228       |
| 62 | US<br>20020<br>11660<br>0 A1 | ⊠ | Method and apparatus for processing events in a multithreaded processor                                                      | 712/218       |
| 63 | US<br>20020<br>11452<br>9 A1 |   | Arithmetic coding apparatus and image processing apparatus                                                                   | 382/247       |
| 64 | US<br>20020<br>10399<br>0 A1 |   | Programmed load precession machine                                                                                           | 712/215       |
| 65 | US<br>20020<br>10384<br>7 A1 |   | Efficient mechanism for inter-thread communication within a multi-threaded computer system                                   | 718/107       |
| 66 | US<br>20020<br>09561<br>4 A1 |   | Method and apparatus for disabling a clock signal within a multithreaded processor                                           | 713/500       |
| 67 | US<br>20020<br>09191<br>5 A1 |   | Load prediction and thread identification in a multithreaded microprocessor                                                  | 712/225       |
| 68 | US<br>20020<br>08784<br>4 A1 |   | Apparatus and method for concealing switch latency                                                                           | 712/228       |

wood Cliffs, N.J., 1991, and within the co-pending, commonly assigned patent application entitled "High Performance Superscalar Microprocessor", Ser. No. 08/146, 382, filed Oct. 29, 1993 by Witt, et al., abandoned and continued in application Ser. No. 501,243 filed Jul. 10, 1995, now U.S. Pat. No. 5,651,125. These documents are incorporated herein by reference in their entirety.

that data coherency is maintained in situations where readprogram instruction sequence. Reorder buffer 216 ensures executed may not be the same as the order of the original obtained. Accordingly, the order in which instructions are 35 operand result for the previous instruction has been issued to the corresponding functional unit 212 until the which modifies the required operand, the instruction is not reorder buffer 216 which corresponds to an instruction been tagged with a location of a previous result value within within one of the reservation station units 210A-210D has That is, it an operand associated with a pending instruction the values of any required operand(s) are made available. Instructions are issued to functional units for execution after technique is commonly referred to as "result forwarding"). the result is passed to update reorder buffer 216 (this 210A-210D that are waiting for that result at the same time instruction is passed directly to any reservation station units by one of the functional units ZIZA-ZIZD, the result of that instruction). It is noted that when an instruction is executed generated (i.e., by completion of the execution of a previous corresponding reservation station until the result has been is provided from reorder buffer 216 and is stored within the a particular operand is not available, a tag for that operand corresponding functional unit and the values of operands. If 12 execution instructions to be speculatively executed by the stations 210A-210D contain locations to store bit-encoded three pending instructions. Each of the four reservation ZIOA-ZIOD may store instruction information for up to As stated previously, each reservation station unit  $_{10}$  executed by the corresponding functional units 212A–212D. temporarily store instruction information to be speculatively Reservation station units 210A-210D are provided to

40 after-write dependencies occur.

In one embodiment, each of the functional units 212 is configured to perform integer arithmetic operations of addition and subtraction, as well as shifts, notates, logical operations, and branch operations. It is noted that a floating operations, and branch operations. It is noted that a floating operations and operations are supplyed to accommodate floating point operations.

Each of the functional units 212 also provides information regarding the execution of conditional branch instructions to the branch prediction unit 220. If a branch prediction was uncorrect, branch prediction unit 220 flushes instructions subsequent to the mispredicted branch that have entered the instruction processing pripeline, and causes prefetch predecode unit 202 to fetch the required instructions resolt so fetch the required instructions from instruction cache 200 to fetch the required instructions from function cache 200 to fetch the required instructions from instruction are discarded, instructions in the original promistruction are discarded, including those which were specularity executed and temporarily stored in load/store unit latively executed and temporarily stored in load/store unit 222 and reorder buffer 216. Exemplary configurations of suitable branch prediction mechanisms are well known.

Results produced by functional units 212 are sent to the reorder buffer 216 if a register value is being updated, and to the load/store unit 222 if the contents of a memory location is changed. If the result is to be stored in a register, so the reorder buffer 216 stores the result in the location reserved for the value of the register when the instruction was decoded. As stated previously, results are also broadcast

location within reorder buffer 216 is reserved upon decode of an instruction that involves the update of a register to thereby store speculative register states. Reorder buffer 216 may be implemented in a first-in-first-out configuration wherein speculative results move to the "top" of the buffer as they are validated and written to the register file, buffer as they are validated and written to the register file. Other specific configurations of reorder buffer 216 are also prediction is incorrect, the results of speculatively-executed prediction is incorrect, the results of speculatively-executed instructions along the mispredicted path can be invalidated in the buffer before they are written to register file 218.

and into functional unit 212B, and so on. decode unit 208B are passed to reservation station unit 210B execution. Similarly, instructions aligned and dispatched to unit 210A and subsequently to functional unit 212A for through decode unit 208A are passed to reservation station Instructions aligned and dispatched to issue position 0 208, reservation station units 210 and functional units 212. four dedicated "issue positions" are formed by decode units with a dedicated functional unit 212A-212D. Accordingly, reservation station unit 210A-210D is similarly associated cated reservation station unit 210A-210D, and that each I, each decode unit 208A-208D is associated with a dedifunctional unit. It is noted that for the embodiment of FIG. pending instructions awaiting issue to the corresponding values, operand tags and/or immediate data) for up to three tion (i.e., bit encoded execution bits as well as operand unit 210A-210D is capable of holding instruction informa-210A-210D. In one embodiment, each reservation station routed directly to respective reservation station units data provided at the outputs of decode units 208A-208D are The bit-encoded execution instructions and immediate

unit 222. provided to the reservation station unit through load/store corresponds to a memory location, the operand value is value is taken directly from register file 218. If the operand reserved for a required register in reorder buffer 216, the rather than from register file 218. If there is no location operand value (or tag) is provided from reorder buffer 216 buffer has a location reserved for a given register, the eventually execute the previous instruction. If the reorder has not yet been produced by the functional unit that will 2) a tag for the most recently assigned location if the value either: 1) the value in the most recently assigned location, or buffer 216 forwards to the corresponding reservation station used as an operand in the given instruction, the reorder pse a previous location or locations assigned to a register a given instruction it is determined that reorder buffer 216 executed contents of a given register. If following degive of have one or more locations which contain the speculatively execution of a particular program, reorder buffer 216 may one of the real registers. Therefore, at various points during which, upon decode, is determined to modify the contents of location of reorder buffer 216 is reserved for each instruction to thereby allow out of order execution. A temporary storage tions for results which change the contents of these registers ESP). Reorder buffer 216 contains temporary storage locareferred to as EAX, EBX, ECX, EDX, EBP, ESI, EDI and register file includes eight 32 bit real registers (i.e., typically neously. Those of skill in the art will appreciate that the x86 routed to reorder buffer 216 and register file 218 simultaoperand is a register location, register address information is Upon decode of a particular instruction, if a required

Details regarding suitable reorder buffer implementations may be found within the publication "Superscalar Microprocessor Design" by Mike Johnson, Prentice-Hall, Engle-

|    | Docum                        |   |                                                                                                                                               | Current     |
|----|------------------------------|---|-----------------------------------------------------------------------------------------------------------------------------------------------|-------------|
|    | ent<br>ID                    | ט | Title                                                                                                                                         | OR          |
| 69 | US<br>20020<br>08784<br>3 A1 |   | Method and apparatus for reducing components necessary for instruction pointer generation in a simultaneous multithreaded processor           | 712/228     |
| 70 | US<br>20020<br>08784<br>0 A1 |   | Method for converting pipeline stalls to pipeline flushes in a multithreaded processor                                                        | 712/219     |
| 71 | US<br>20020<br>08783<br>5 A1 |   | Method and apparatus for improving dispersal performance in a processor through the use of no-op ports                                        | 712/215     |
| 72 | US<br>20020<br>08337<br>3 A1 |   | Journaling for parallel hardware threads in multithreaded processor                                                                           | 714/38      |
| 73 | US<br>20020<br>07812<br>2 A1 |   | Switching method in a multi-threaded processor                                                                                                | 718/102     |
| 74 | US<br>20020<br>05603<br>7 Al |   | Method and apparatus for providing large register address space while maximizing cycletime performance for a multi-threaded register file set | 712/215     |
| 75 | US<br>20020<br>05459<br>4 A1 |   | Non-blocking, multi-context pipelined processor                                                                                               | 370/389<br> |
| 76 | US<br>20020<br>04632<br>5 A1 |   | Buffer memory management in a system having multiple execution entities                                                                       | 711/122     |
| 77 | US<br>20020<br>03841<br>6 A1 |   | System and method for reading and writing a thread state in a multithreaded central processing unit                                           | 712/228     |
| 78 | US<br>20020<br>01386<br>1 A1 |   | Method and apparatus for low overhead multithreaded communication in a parallel processing environment                                        | 719/313     |
| 79 | US<br>20020<br>00266<br>7 A1 |   | System and method for instruction level multithreading in an embedded processor using zero-time context switching                             | 712/228     |
| 80 | US<br>20010<br>05645<br>6 A1 |   | PRIORITY BASED SIMULTANEOUS MULTI-THREADING                                                                                                   | 718/103     |
| 81 | US<br>20010<br>04977<br>0 A1 |   | BUFFER MEMORY MANAGEMENT IN A SYSTEM HAVING MULTIPLE EXECUTION ENTITIES                                                                       | 711/129     |
| 82 | US<br>20010<br>04746<br>8 A1 |   | Branch and return on blocked load or store                                                                                                    | 712/228     |
| 83 | US<br>20010<br>03744<br>5 A1 |   | Cycle count replication in a simultaneous and redundantly threaded processor                                                                  | 712/216     |
| 84 | US<br>20010<br>02951<br>5 A1 |   | Method and apparatus for configuring arbitrary sized data paths comprising multiple context processing elements                               | 708/232     |
| 85 | US<br>20010<br>00475<br>5 A1 |   | MECHANISM FOR FREEING REGISTERS ON PROCESSORS THAT PERFORM<br>DYNAMIC OUT-OF-ORDER EXECUTION OF INSTRUCTIONS USING RENAMING<br>REGISTERS      | 712/217     |

associated with this entry has been retired by reorder buffer CRV valid bit indicates, when set, that the call instruction valid bit further indicates that the return tag is valid. The

instruction is conveyed with each instruction through the embodiment, the branch tag associated with a particular which of the two instructions is first in program order. In one with two such instructions may be compared to determine predicted branch instructions. The branch tags associated particular predicted branch instruction with respect to other tion. A branch tag is a number indicative of the order of a assigns a branch tag to each branch, call, and return instrucparticular entry. It is noted that branch prediction unit 220 the respective call and return instruction represented by a The call and return tags are branch tags associated with

branch instruction is detected, then the instruction processassigned to outstanding branch instructions, and another branch tag comprises four bits. If all branch tags are instruction is dispatched. In one specific embodiment, the ciated with the most recently predicted branch when that Each non-branch instruction receives the branch tag assoinstruction processing pipelines of microprocessor 200.

tag becomes available. ing pipeline of microprocessor 2000 is stalled until a branch

branch may be mispredicted because it was predicted to be and the branch is found to be not taken. Alternatively, a next instruction to be executed resides at the target address) mispredicted because it was predicted to be taken (i.e. the target address is incorrect. Additionally, a branch may be may be mispredicted because the speculatively generated 35 for which the target address has been mispredicted. A branch dicted branch" refers to a branch, call, or return instruction tion becomes available As used herein, the term "misprebuffer 216, then the branch tag associated with that instrucing pipeline. If a branch instruction is retired by reorder branch are flushed (or deleted) from the instruction processavailable because instructions subsequent to a mispredicted tags become available. Subsequent branch tags become tag associated with that instruction and subsequent branch a branch instruction is detected as mispredicted, the branch A branch tag becomes available in a number of ways. If

a particular call instruction. Comparator block 260 is used to during a clock cycle in which the entry is being allocated to 50 configured to generate a return PC for a particular entry storage 252. Adder circuit 256 and multiplexor 258 are configured to control the storage of data within return stack a comparator block 260. Return stack control unit 254 is control unit 254, an adder circuit 256, a multiplexor 258, and Return prediction unit 250 further includes a return stack branch is found to be taken. memory contiguous to the branch instruction) and the not taken (i.e. the next instruction to be executed resides in

65 return pointer 264 does not contain valid data. call instruction data. The entry above the entry indicated by stack storage 252 which contains the most recently allocated The top of return stack storage 252 is the entry within return pointer indicative of the "top" of return stack storage 252. 60 entry or entries is selected. Pointer bus 264 conveys a bus 262 indicate a read or write operation as well as which 252. Control signals conveyed along with the data upon data the storage locations (or entries) within return stack storage bus 262 allows the reading and writing of data into each of noted that return stack storage 252 may employ multiple 55 storage 252 via a data bus 262 and a pointer bus 264. Data Return stack control unit 254 is coupled to return stack recover from mispredicted branch instructions.

plurality of buses from other units within microprocessor Additionally, return stack control unit 254 receives a

> Generally speaking, load/store unit 222 provides an interinstruction executions to obtain the required operand values. instructions may be waiting for the results of previous to reservation station units 210A-210D where pending

store instructions to ensure that data coherency is maindependency checking for load instructions against pending request information. The load/store unit 222 also performs load/store unit 222 has room for the pending load or store When the buffer is full, a decode unit must wait until the units 2008 arbitrate for access to the load/store unit 2222. and address information for pending loads or stores. Decode with a load/store buffer with eight storage locations for data 224. In one embodiment, load/store unit 222 is configured face between functional units 212A-212D and data cache

Data cache 224 is a high speed cache memory provided to

Turning now to FIG. 2, one embodiment of a return configurations, including a set associative configuration. may be implemented in a variety of specific memory sixteen kilobytes of data. It is understood that data cache 224 embodiment, data cache 224 has a capacity of storing up to unit 222 and the main memory subsystem. In one temporarily store data being transferred between load/store

call instructions include the CALL instruction and the INT cessor 260 employing the x86 microprocessor architecture, processor 200. It is noted that in embodiments of microprodicted early in the instruction processing pipeline of microinvalidated. Advantageously, return addresses may be preencountered), entries within return stack storage 252 are by return prediction unit 250 (e.g. a fake return has been dicted instruction. When a return instruction is mispredicted carded call and return instructions subsequent to the mispredetected prior to the mispredicted instruction and has disage 252 accurately reflects call and return instructions detection of the mispredicted instruction, return stack storinstructions for which the target is mispredicted. Following configured to recover from mispredicted branches and call return addresses. Additionally, return prediction unit 250 is associated with a return instructions based on the recorded return stack storage 252, and predicts the return address return addresses associated with call instructions within a and the IRET instruction. Return prediction unit 250 stacks architecture, return instructions include the RET instruction microprocessor 200 employing the x86 microprocessor predictions for return instructions. In embodiments of diction unit 250 is configured to provide return address branch prediction unit 220. Generally speaking, return precessor 200, return prediction unit 250 is included within instruction cache 204. In another embodiment of microproprocessor 200, return prediction unit 250 is included within prediction unit 250 is shown. In one embodiment of micro-

has been used as a prediction for a return instruction. The RV are valid. The RV valid bit indicates, when set, that the entry example, if the CV bit is set, then the return PC and call tag indicate that the entry includes valid information. For instruction has been detected. The CV valid bit serves to stack entry. The CV valid bit indicates, when set, that a call associated with the call instruction represented by the return and several valid bits. The return PC is the return address return program counter (or return PC), a call tag, a return tag, storage 252 includes eight entries. Each entry includes a be rows of a storage array. In one embodiment, return stack registers as its storage locations, or the storage locations may several fields within each entry (or storage location). It is In one embodiment, return stack storage 252 includes

|     | Docum<br>ent<br>ID   | υ | Title                                                                                                                           | Current<br>OR |
|-----|----------------------|---|---------------------------------------------------------------------------------------------------------------------------------|---------------|
| 86  | US<br>67218<br>73 B2 |   | Method and apparatus for improving dispersal performance in a processor through the use of no-op ports                          | 712/215       |
| 87  | US<br>66979<br>35 B1 |   | Method and apparatus for selecting thread switch events in a multithreaded processor                                            | 712/228       |
| 88  | US<br>66944<br>25 B1 |   | Selective flush of shared and other pipeline stages in a multithread processor                                                  | 712/216       |
| 89  | US<br>66943<br>47 B2 |   | Switching method in a multi-threaded processor                                                                                  | 718/108       |
| 90  | US<br>66752<br>85 B1 |   | Geometric engine including a computational module without memory contention                                                     | 712/201       |
| 91  | US<br>66751<br>92 B2 |   | Temporary halting of thread execution until monitoring of armed events to memory location identified in working registers       | 718/107       |
| 92  | US<br>66718<br>27 B2 |   | Journaling for parallel hardware threads in multithreaded processor                                                             | 714/38        |
| 93  | US<br>66683<br>17 B1 |   | Microengine for parallel processor architecture                                                                                 | 712/245       |
| 94  | US<br>66586<br>55 B1 |   | Method of executing an interpreter program                                                                                      | 717/139       |
| 95  | US<br>66585<br>51 B1 |   | Method and apparatus for identifying splittable packets in a multithreaded VLIW processor                                       | 712/24        |
| 96  | US<br>66584<br>47 B2 |   | Priority based simultaneous multi-threading                                                                                     | 718/103       |
| 97  | US<br>66503<br>30 B2 |   | Graphics system and method for processing multiple independent execution threads                                                | 345/506       |
| 98  | US<br>66402<br>99 B1 |   | Method and apparatus for arbitrating access to a computational engine for use in a video graphics controller                    | 712/245       |
| 99  | US<br>66309<br>35 B1 |   | Geometric engine including a computational module for use in a video graphics controller                                        | 345/522       |
| 100 | US<br>66292<br>36 B1 |   | Master-slave latch circuit for multithreaded processing                                                                         | 712/228       |
| 101 | US<br>66256<br>54 B1 |   | Thread signaling in multi-threaded network processor                                                                            | 709/230       |
| 102 | US<br>66248<br>18 B1 |   | Method and apparatus for shared microcode in a multi-thread computation engine                                                  | 345/522       |
| 103 | US<br>66112<br>76 B1 |   | Graphical user interface that displays operation of processor threads over time                                                 | 345/772       |
| 104 | US<br>66067<br>04 B1 |   | Parallel multithreaded processor with plural microengines executing multiple threads each microengine having loadable microcode | 712/248       |
| 105 | US<br>65947<br>55 B1 |   | System and method for interleaved execution of multiple independent threads                                                     | 712/239       |
| 106 | US<br>65913<br>57 B2 |   | Method and apparatus for configuring arbitrary sized data paths comprising multiple context processing elements                 | 712/18        |
| 107 | US<br>65879<br>06 B2 |   | Parallel multi-threaded processing                                                                                              | 710/240       |
| 108 | US<br>65781<br>37 B2 |   | Branch and return on blocked load or store                                                                                      | 712/228       |

tion alignment unit 206 is stalled. processing pipeline within instruction cache 20% and instrucasserted, the instruction fetching portion of the instruction 250. During a clock cycle in which these signals are bus 272 or a decode return bus 274 to return prediction unit then the decode unit asserts a call signal upon a decode call instruction was not detected by branch prediction unit 220, instruction and the associated indication signifies that the sor 200. When a decode unit 208 decodes a call or return through the instruction processing pipeline of microproces-

control unit 254. The offset is conveyed by that decode unit to return stack decode unit 208 which decodes the call or return instruction. instruction offset is calculated according to the particular decode units 208 in the case of a call instruction. The multiplexor 258 is directed to accept the PC address from the branch tag conveyed by decode units 208. Additionally, unit 220. The call or return tag in this case is one greater than  $_{\rm 25}$ call and return instructions detected by branch prediction cated a storage location similar to the above discussion for bus 276. Additionally, the call or return instruction is alloconveyed to comparator block 260 upon a decode branch tag  $_{20}$  of the instruction processing pipeline. The branch tag is or return instruction was not detected during the fetch stage instruction prior to the instruction in this case, since the call branch tag is actually associated with a predicted branch branch tag carried by the instruction. It is noted that this pranch (described below) having a branch tag equal to the call or return instruction is treated similar to a mispredicted control unit 254 performs two actions concurrently. First, the not detected by branch prediction unit 220, return stack When decode units 208 decode a call or return instruction

therefore is deleted from return stack storage 252. 50 return tag is thereby overwritten by another entry and down by one location. The storage location identified by the the storage location identified by the return tag are shifted between the storage location indicated by pointer bus and lower storage location. In other words, storage locations 45 location identified by the return tag are copied to the next age location indicated by pointer bus 264 and the storage storage locations within return stack 252 between the storstack storage 252. In one embodiment, the contents of to the tag sent by reorder buffer 216) is deleted from return 40 associated storage location (identified by a return tag equal Upon receipt of a retired return instruction indication, the by a call tag equal to the tag sent by reorder buffer 216). CRV bit is set in the associated storage location (identified 278. Upon receipt of a retired call instruction indication, the 35 is transferred to return stack control unit 254 upon retire bus a call or return instruction, the associated call or return tag During a clock cycle in which reorder buffer 216 retires

252 is purged of information related to the flushed instruc-60 branch are flushed from the pipeline. Return stack storage spone, the instructions subsequent to the mispredicted quent to the mispredicted branch may be incorrect. As noted within the instruction processing pipeline which are subsemispredicted, then information associated with instructions 55 storing incorrect information. When a branch instruction is tions may indicate that portions of return stack 252 are tively execute instructions out-of-order, branch mispredic-Because microprocessor 200 is configured to specula-

indicates that a mispredicted branch has been detected. In diction signal upon branch misprediction conductor 278 65 260 is included within return prediction unit 250. A mispreupon detection of a mispredicted branch, comparator block In order to recover the contents of return stack storage 252

> return prediction unit 250. instruction cache 204, then the return signal is asserted to instruction is detected within a set of instructions fetched by asserted to return prediction unit 250. Similarly, if a return fetched by instruction eache 2004, then the call signal is call instruction is detected within a set of instructions each entry and predicted according to that information. If a instructions are detected according to information stored in indexed by the instruction fetch address. Branches and call prediction structure includes a plurality of storage locations the instruction is a call or return instruction. The branch the branch prediction structure, along with an indication that return instructions are stored as predicted branches within incorporated herein by reference in its entirety. Call and No. 08/838,680 filed Apr. 9, 1997. This patent application is by Tran, et al., abandoned and continued in application Ser. Operating Same", Ser. No. 08/420,666, filed Apr. 12, 1995 application entitled: "A Way Prediction Unit and Method for described within the commonly assigned, co-pending patent 220 employs a branch prediction structure similar to that tion unit 220. In one embodiment, branch prediction unit return instruction (respectively) detected by branch predicreturn signals are indicative, when asserted, of a call and return signals from branch prediction unit 220. The call and 200. A call bus 266 and a return bus 268 convey call and

> bus 266). Additionally, the CV bit is set. assigned by branch prediction unit 220 (conveyed upon call entry within return stack storage 252, along with the call tag 204 in this case. The return PC is stored into the allocated 254, and selects the address conveyed by instruction cache Multiplexor 258 is controlled by return stack control unit (transferred to multiplexor 258 from instruction cache 204). upon call bus 266) and from the address being fetched return stack control unit 254 from branch prediction unit 220 instruction within the fetched line (the offset is transferred to is calculated by adder circuit 256 from the offset of the call that the allocated entry is the top of the stack). The return PC entry becomes the top of the stack (i.e. the pointer indicates mined according to the pointer on pointer bus 264, and the entry to the call instruction. The allocated entry is deterprediction unit 220, return stack control unit 254 allocates an Upon receipt of the asserted call signal from branch

or return instruction was detected by branch prediction unit instruction. An indication of whether or not a particular call may not store an indication of a particular call or return branch prediction unit 220 is a speculative structure which It is noted that the branch prediction structure within call bus 266 and return bus 268 during a given clock cycle. instruction is indicated by branch prediction unit 220 upon noted that, in one embodiment, either a call or a return prediction is made for that return instruction. It is further stack storage 252 meet the above mentioned criteria, then no tions. It is noted that if no storage locations within return instruction cache 204 for use in fetching subsequent instruc-PC is conveyed upon return PC prediction bus 270 to ous return instruction if the RV bit is set. The selected return Therefore, the return PC is already associated with a previreturn address prediction during a previous clock cycle. indicative that the associated return PC has been used as a 264) for which the RV bit is not yet set. The RV bit is return stack (as indicated by the pointer upon pointer bus return PC stored within the entry nearest to the top of the return instruction. The return address is predicted to be the return stack control unit 254 predicts a return address for the prediction unit 220 (along with an associated return tag), Upon receipt of the asserted return signal from branch

220 is conveyed with each call and return instruction

|     | Docum                | ט | Title                                                                                                                                                    | Current<br>OR |
|-----|----------------------|---|----------------------------------------------------------------------------------------------------------------------------------------------------------|---------------|
| 109 | US<br>65739<br>00 B1 |   | Method, apparatus and article of manufacture for a sequencer in a transform/lighting module capable of processing multiple independent execution threads | 345/537       |
| 110 | US<br>65678<br>39 B1 |   | Thread switch control in a multithreaded processor system                                                                                                | 718/103       |
| 111 | US<br>65670<br>84 B1 |   | Lighting effect computation circuit and method therefore                                                                                                 | 345/426       |
| 112 | US<br>65534<br>79 B2 |   | Local control of multiple context processing elements with major contexts and minor contexts                                                             | 712/16        |
| 113 | US<br>65429<br>91 B1 |   | Multiple-thread processor with single-thread interface shared among threads                                                                              | 712/228       |
| 114 | US<br>65429<br>21 B1 |   | Method and apparatus for controlling the processing priority between multiple threads in a multithreaded processor                                       | 718/108       |
| 115 | US<br>65359<br>05 B1 |   | Method and apparatus for thread switching within a multithreaded processor                                                                               | 718/108       |
| 116 | US<br>65325<br>09 B1 |   | Arbitrating command requests in a parallel multi-threaded processing system                                                                              | 710/240       |
| 117 | US<br>65264<br>98 B1 |   | Method and apparatus for retiming in a network of multiple context processing elements                                                                   | 712/11        |
| 118 | US<br>65078<br>62 B1 |   | Switching method in a multi-threaded processor                                                                                                           | 718/107       |
| 119 | US<br>64969<br>25 B1 |   | Method and apparatus for processing an event occurrence within a multithreaded processor                                                                 | 712/244       |
| 120 | US<br>64937<br>41 B1 |   | Method and apparatus to quiesce a portion of a simultaneous multithreaded central processing unit                                                        | 718/107       |
| 121 | US<br>64704<br>43 B1 |   | Pipelined multi-thread processor selecting thread instruction in inter-stage buffer based on count information                                           | 712/205       |
| 122 | US<br>64704<br>22 B2 |   | Buffer memory management in a system having multiple execution entities                                                                                  | 711/129       |
| 123 | US<br>64571<br>16 B1 |   | Method and apparatus for controlling contexts of multiple context processing elements in a network of multiple context processing elements               | 712/16        |
| 124 | US<br>64386<br>71 B1 |   | Generating partition corresponding real address in partitioned mode supporting system                                                                    | 711/173       |
| 125 | US<br>63780<br>65 B1 |   | Apparatus with context switching capability                                                                                                              | 712/228       |
| 126 | US<br>63742<br>86 B1 |   | Real time processor capable of concurrently running multiple independent JAVA machines                                                                   | 718/108       |
| 127 | US<br>63634<br>75 B1 |   | Apparatus and method for program level parallelism in a VLIW processor                                                                                   | 712/206       |
| 128 | US<br>63570<br>16 B1 |   | Method and apparatus for disabling a clock signal within a multithreaded processor                                                                       | 713/601       |
| 129 | US<br>63518<br>08 B1 |   | Vertically and horizontally threaded processor with multidimensional storage for storing thread data                                                     | 712/228       |
| 130 | US<br>63493<br>63 B1 |   | Multi-section cache with different attributes for each section                                                                                           | 711/129       |
| 131 | US<br>63413<br>47 B1 |   | Thread switch logic in a multiple-thread processor                                                                                                       | 712/228       |

reorder buffer 216 storing the associated instruction. tags from reorder buffer 216 indicative of the position within 25 contemplated in which call, return and branch tags may be branch prediction unit 220. Additionally, an embodiment is exception of detection of call and return instructions in functionality is similar to the above description with the not in branch prediction unit 220. For this embodiment, 20 and return instructions are detected in decode units 208 but of return prediction unit 250 is contemplated in which call when it conveys a binary zero value. A second embodiment asserted when it conveys a binary one value or, alternatively, information. A particular signal may be defined to be 15 when it conveys a value indicative of a particular piece of being "asserted". A signal may be defined as being asserted further noted that the above discussion describes signals as pipeline due to the first instructions being detected. It is still instructions will be purged from the instruction processing 10 detected in program order. The other call and/or return control unit 254 is configured to select the first instruction detected simultaneously by decode units 208, return stack that, when multiple call and/or return instructions are branch recovery sequence noted above. It is further noted return stack is recovered according to the mispredicted instruction is used as a mispredicted branch tag, and the generates an exception, the branch tag conveyed with the bandling routine, similar to an interrupt. If an instruction exception causes program execution to jump to an exception

pairs may be separated from INT-IRET pairs. clear is used as the prediction. In this manner, CALL-RET stack storage 252A in which both the RV and IXC bits are instruction, the storage location nearest the top of return 40 prediction is made by return prediction unit 250 for a RET IXC bit is set is used as the prediction. Similarly, when a return stack storage 252A in which the RV bit is clear and the an IRET instruction, the storage location nearest the top of When a prediction is made by return prediction unit 250 for 35 cleared if the call tag is associated with a CALL instruction. the call tag is associated with an IMT instruction, and is includes an IXC bit and a ISTART bit. The IXC bit is set if shown in FIG. 2. Additionally, return stack storage 252A CV, RV, and CRV bits similar to return stack storage 252 30 return PC, call tag, and return tag are included, as well as the stack storage 252 (return stack storage 252A) is shown. The Turning now to FIG. 2A, another embodiment of return

entry and the entry nearest the top for which the ISTART bit entries within return stack storage 252A between the top 65 allocates an entry. If an IRET instruction is mispredicted, the instruction, the ISTART bit is set when the INT instruction is set. If the interrupt service routine is entered due to an INT instruction interrupted as the return PC and the ISTART bit entry is allocated with the program count value of the 60 routine is entered due to an asynchronous interrupt, then an enters an interrupt service routine. If the interrupt service instruction. The ISTART bit is set when microprocessor 200 date the entire return stack upon a mispredicted IRET RET instruction. It may be performance limiting to invali-55 IRET has a higher probability of being mispredicted than the Because of this alternative usage of the IRET instruction, 200 to return to the interrupted instruction sequence. often ending in an IRET instruction to cause microprocessor executed by microprocessor 200 in response to the interrupt, so interrupt microprocessor 200. An interrupt service routine is interrupt pin which may be asserted by external hardware to rupts. For example, microprocessor 200 may include an IRET instruction is used to return from asynchronous interconjunction with the interrupt instruction. Additionally, the It is noted that the IRET instruction is not only used in

> bus 286, respectively. comparator block 260 upon call tag bus 284 and return tag stored within return stack storage 252 are conveyed to bus 280 or 276. It is noted that call tags and return tags associated call or return tag and a branch tag from branch tag asserted according to a matching comparison between the within return stack storage 252, and the signal may be includes an invalidate signal for each call tag and return tag 282 to return stack control unit 254. Invalidate bus 282 associated invalidate signal is asserted upon invalidate bus to be subsequent to the branch tag in program order, then an branch tag bus 280. If a particular call or return tag is found return stack storage 252 to the branch tag conveyed upon block 260 compares the call and return tags stored within Upon receipt of the branch misprediction signal, comparator branch misprediction conveys the misprediction signal. embodiment, the functional unit 212 which detects the 216 conveys the misprediction signal. In still another misprediction signal. In another embodiment, reorder buffer one embodiment, branch prediction unit 220 conveys the

Upon receipt of an asserted invalidate signal, the associated CV or RV bit within return stack storage 252 is reset. In this manner, call and return instructions subsequent to the mispredicted branch are removed from return stack storage. 252. Once a call instruction is retired by reorder buffer 216, the associated call tag is invalid. Therefore, if an invalidate is seaf, then the CV bit is left unmodified. Additionally, asorage entries for which the call instruction have been invalidated no longer store valid information. Similar to retired, the storage locations are shifted and the pointer value storage adjusted to delete the invalid entries from neutron has been retired, the storage locations are shifted and the pointer value storage 252.

myalidated. invalid for the new task, and therefore the return stack is example. The contents of return stack storage 252 may be cache 2004 may be invalidated due to a task switch, for invalidated, then the return stack is invalidated. Instruction Additionally, if the contents of instruction cache 2004 are when a mispredicted return instruction is detected. unit 250. Therefore, the entire return stack is invalidated distinguish the various conditions within return prediction tion may be the cause of the misprediction, it is difficult to conditions other than the existence of a fake return instruction may be indicative of a fake return instruction. Although further noted that a mispredicted address for a return instrucbranch instructions for purposes of this discussion. It is target address is mispredicted are treated as mispredicted It is noted that call and return instructions for which the

Return stack storage 252 includes a finite number of centries, and may therefore become full before any entries are deleted. When a call instruction is detected and return stack storage 252 is full of valid entries, then the entry stored at the "bottom" of the stack (i.e. the entry allocated prior to other entries within the stack) is deleted. In one embodiment, the pointer upon pointer bus 254 wraps anound 252 and allocates that location within return stack storage location within return stack storage location to the newly detected call instruction. In another embodiment, all storage locations within return stack storage S52 are shifted down one location (similar to when an entry is deleted due to return instruction and the top storage location is allocated to the new return instruction. The pointer upon pointer cated to the new return instruction. The pointer upon pointer bus 264 is unmodified for this embodiment.

It is noted that certain instruction within various micro-processor architectures may generate an "exception". An

|              | Docum                |   |                                                                                                                                                                                                       | Current |
|--------------|----------------------|---|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------|
|              | ent                  | U | Title                                                                                                                                                                                                 | OR      |
|              | US US                |   | Reducing inherited logical to physical register mapping                                                                                                                                               |         |
| 132          | 63306<br>61 B1       |   | information between tasks in multithread system using register group identifier                                                                                                                       | 712/228 |
| 133          | US<br>63145<br>11 B1 |   | Mechanism for freeing registers on processors that perform dynamic out-of-order execution of instructions using renaming registers                                                                    | 712/21  |
| 134          | US<br>62984<br>31 B1 |   | Banked shadowed register file                                                                                                                                                                         | 712/28  |
| 135          | US<br>62956<br>00 B1 |   | Thread switch on blocked load or store using instruction thread field                                                                                                                                 | 712/22  |
| 136          | US<br>62601<br>50 B1 |   | Foreground and background context controller setting processor to power saving mode when all contexts are inactive                                                                                    | 713/32  |
| 137          | US<br>62567<br>75 B1 |   | Facilities for detailed software performance analysis in a multithreaded processor                                                                                                                    | 717/12  |
| 138          | US<br>62533<br>13 B1 |   | Parallel processor system for processing natural concurrencies and method therefor                                                                                                                    | 712/22  |
| 139          | US<br>62437<br>36 B1 |   | Context controller having status-based background functional task resource allocation capability and processor employing the same                                                                     | 718/10  |
| 140          | US<br>62267<br>35 B1 |   | Method and apparatus for configuring arbitrary sized data paths comprising multiple context processing elements                                                                                       | 712/18  |
| 141          | US<br>62232<br>74 B1 |   | Power-and speed-efficient data storage/transfer architecture models and design methodologies for programmable or reusable multi-media processors                                                      | 712/34  |
| 142          | US<br>62232<br>08 B1 |   | Moving data in and out of processor units using idle register/storage functional units                                                                                                                | 718/10  |
| 143          | US<br>62162<br>20 B1 |   | Multithreaded data processing method with long latency subinstructions                                                                                                                                | 712/21  |
| 144          | US<br>62125<br>44 B1 |   | Altering thread priorities in a multithreaded processor                                                                                                                                               | 718/10  |
| L45          | US<br>62125<br>42 B1 |   | Method and system for executing a program within a multiscalar processor by processing linked thread descriptors                                                                                      | 718/10  |
| 146          | US<br>62054<br>68 B1 |   | System for multitasking management employing context controller having event vector selection by priority encoding of contex events                                                                   | 718/10  |
| L <b>4</b> 7 | US<br>61700<br>51 B1 |   | Apparatus and method for program level parallelism in a VLIW processor                                                                                                                                | 712/22  |
| 48           | US<br>61611<br>66 A  |   | Instruction cache for multithreaded processor                                                                                                                                                         | 711/12  |
| .49          | US<br>61346<br>53 A  |   | RISC processor architecture with high performance context<br>switching in which one context can be loaded by a<br>co-processor while another context is being accessed by an<br>arithmetic logic unit | 712/22  |
| .50          | US<br>61345<br>78 A  |   | Data processing device and method of operation with context<br>switching                                                                                                                              | 718/10  |
| 51           | US<br>61227<br>19 A  |   | Method and apparatus for retiming in a network of multiple context processing elements                                                                                                                | 712/15  |
| .52          | US<br>61087<br>60 A  |   | Method and apparatus for position independent reconfiguration in a network of multiple context processing elements                                                                                    | 711/20  |
| .53          | US<br>61051<br>27 A  |   | Multithreaded processor for processing multiple instruction streams independently of each other by flexibly controlling throughput in each instruction stream                                         | 712/21  |

illustrate the dynamics of the present return stack structure in more detail. Beginning at arrow 400, an instruction stream including instructions INS0 through INS1 are executed instructions INS0 through INS9, similat to FIG. 3, represent linetructions which are not branch. Call, or Ret instructions. Pollowing INS1 is a branch instruction Imp, with a branch tag of one. In this example, decimal numbers are used for branch tags. However, many other numbering schemes may be used for branch tags. The Imp instruction is predicted be used for branch tags. The Imp instruction is predicted in predicting and instruction farms are used for branch tags. The Imp instruction is predicted the instruction (arrow 402).

subsequent to Call instruction 411 (line 424). (line \$20) and Ret instruction \$17 returns to the instruction returns to the instruction subsequent to Call instruction 413 shown by dotted line 412. Similarly, Ret instruction 415 subsequent to Call instruction 405. This relationship is is noted that Ret instruction 409 returns to the instruction is numbered for reference in FIGS. 4B through 4E below. It secutive branch tags as shown in FIG. 4A. Each instruction 428. Branch, Call, and Ret instructions are assigned conthrough arrows 406, 408, 410, 414, 416, 418, 422, 426, and instruction stream is similar, following consecutively the Ret instruction (arrow 404). The remainder of the Instruction execution then begins at the predicted target of 15 plary numbering scheme shown is a branch tag value of two. assigned the next available branch tag, which in the exemfollowed by a Ret instruction. The Ret instruction is At the target of the Imp instruction is an instruction INS2

storage 252. is the current top of the valid entries within return stack the third entry of return stack storage 252. Third entry 444 45 valid. The pointer upon pointer bus 362 is shown pointing to the return tag field to indicate that the return tags are not as return address predictions. Each entry includes a dash in set, and so entries 440, 442, and 444 have not yet been used indicating that the entries are valid. None of the RV bits are 40 valid for those instructions. Each entry has its CV bit set, CRV bits set, and so the respective call tags are no longer entries storing return addresses A and B have their respective bits. A call tag of zero is stored in each entry, although the number. In one embodiment, a return address comprises 32 35 addresses for brevity. A return address is in fact a multi-bit In this example, letters are used as exemplary return and 444), with return addresses A, B, and C (respectively). Three valid entries are shown (reference numbers 440, 442, the return stack storage at arrow 400 shown in FIG. 4A). 30 252 prior to the execution of INSO is shown (i.e. the state of Turning now to FIG. 4B, the state of return stack storage

pointing to fourth entry 446. 405 (i.e. the value of three). The pointer is now shown 65 405, and the call tag is set to the call tag of Call instruction Fourth entry 446 is allocated at the fetch of Call instruction or Ret instruction encountered is Call instruction 405. is stored into the return tag field of entry 444. The next Call and the return tag of Ret instruction 403 (i.e. the value two) is predicted for Ret instruction 403. The RV bit is then set, 444. Since the RV bit of entry 444 is clear, return address C is fetched, the top entry in return stack storage 252 is entry tered is Ret instruction 403. At the time the Ret instruction explained by considering that the first call or return encounss is set. The contents of return stack storage 252 may be 444 includes a valid return tag value of two, and the RV bit call tag of three, reference number 446). Additionally, entry added from the state shown in FIG. 4B (return address D, 407 (i.e. at arrow 406 in FIG. 4A). A fourth entry has been storage 252 are shown at the time of fetching Jmp instruction Turning now to FIG. &C, the contents of return stack

is set are invalidated. In this manner, only the portion of return stack storage 252A which is associated with the interrupt service routine may be invalidated.

INS6 are executed, and a return instruction Ret B is encounthird set of contiguous instructions labeled INSS through address specified by the Call B instruction (arrow 304). A causing instruction execution to transfer to yet another executed. A second call instruction Call B is encountered, contiguous instructions labeled INS3 through INS4 are address specified by Call A (arrow 302), and a set of A is encountered, Instruction execution transfers to an through INS2 are executed, and then a call instruction Call instruction stream of contiguous instructions labeled INSO branch, Call, or Ret instructions. Beginning at arrow 300, an INSO through INS8 represent instructions which are not instructions are used. It is noted that instructions labeled discussion, for brevity, it is assumed that no fake return tion unit 250 in predicting addresses. In the following is shown to further highlight the operation of return predic-Turning now to FIG. 3, an exemplary instruction sequence

terea.

call and return instructions are associated. stack storage 252 is well suited to the LIFO manner in which be paired with a return instruction, etc. it is noted that return manner. The last call instruction to be executed is the first to paired with call instructions in a last-in, first-out (LIFO) executed from that point forward. Return instructions are the instruction INST, and instructions continue to be Immediately following the Call A instruction in memory is recently executed call instruction (Call A, arrow 308). Therefore, Ret A is defined to return to the second most second return instruction encountered in the example. A is immediately following Call B in this example, and is the locations atoring the Call B instruction. The instruction Ret stored in memory locations configuous to the memory B, arrow 366). In other words, Ret B fetches an instruction following the most recently executed call instruction (Call defined to return to the instruction in memory immediately Ret B is the first return instruction encountered, and so is

address following Call B. Therefore, the return address for Ret B is predicted to be the since no return address predictions have yet been made. spove). In this example, the entry chosen is the top entry, address prediction is used as the prediction address (as noted stack storage 252 which has not been used as a return When Ret B is fetched, the entry nearest the top of return pointer bus 264), and the Call A entry is second from the top. the return stack storage 252 (as defined by the pointer upon prediction unit 250 as well. The Call B entry is at the top of fetched and the return address is placed within return prediction unit 250. During a later clock cycle, Call B is fetched and the return address is placed within return and Ret A instructions. During a clock cycle, Call A is may be gained by predicting the return address of the Ret B executed by a superscalar microprocessor, then performance If the exemplary instruction sequence shown in FIG. 3 is

During a subsequent clock cycle, prior to the Ret B instruction retiring, the Ret A instruction is fetched. The return address prediction is again made according to the top entry which has not been used as a prediction. Although Call B has yet to be retired), that entry has been used as a return address prediction. Therefore, the accound from the top entry is used. The address of the instruction immediately following Call A is used as the return address prediction for Ret A. In both cases, the correct return address is predicted.

Turning now to FIG. 4A, a second exemplary instruction sequence is used to

|     | Docum               |   |                                                                                                                                                                                             | Current |
|-----|---------------------|---|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------|
|     | ent<br>ID           | ט | Title                                                                                                                                                                                       | OR      |
| 154 | US<br>61050<br>51 A |   | Apparatus and method to guarantee forward progress in execution of threads in a multithreaded processor                                                                                     | 718/103 |
| 155 | US<br>61015<br>99 A |   | System for context switching between processing elements in a pipeline of processing elements                                                                                               | 712/228 |
| 156 | US<br>60921<br>75 A |   | Shared register storage mechanisms for multithreaded computer systems with out-of-order execution                                                                                           | 712/23  |
| 157 | US<br>60887<br>88 A |   | Background completion of instruction and associated fetch request in a multithread processor                                                                                                | 712/205 |
| 158 | US<br>60790<br>08 A |   | Multiple thread multiple data predictive coded parallel processing system and method                                                                                                        | 712/11  |
| 159 | US<br>60761<br>57 A |   | Method and apparatus to force a thread switch in a multithreaded processor                                                                                                                  | 712/228 |
| 160 | US<br>60731<br>59 A |   | Thread properties attribute vector based thread selection in multithreading processor                                                                                                       | 718/103 |
| 161 | US<br>60617<br>10 A |   | Multithreaded processor incorporating a thread latch register for interrupt service new pending threads                                                                                     | 718/107 |
| 162 | US<br>60527<br>08 A |   | Performance monitoring of thread switch events in a multithreaded processor                                                                                                                 | 718/108 |
| 163 | US<br>60187<br>59 A |   | Thread switch tuning tool for optimal performance in a computer processor                                                                                                                   | 718/108 |
|     | US<br>59499<br>94 A |   | Dedicated context-cycling computer with timed context                                                                                                                                       | 712/228 |
| 165 | US<br>59336<br>27 A |   | Thread switch on blocked load or store using instruction thread field                                                                                                                       | 712/228 |
| 166 | US<br>59241<br>20 A |   | Method and apparatus for maximizing utilization of an internal processor bus in the context of external transactions running at speeds fractionally greater than internal transaction times | 711/141 |
|     | US<br>59151<br>23 A |   | Method and apparatus for controlling configuration memory contexts of processing elements in a network of multiple context processing elements                                              | 712/16  |
| 168 | US<br>59139<br>25 A |   | Method and system for constructing a program including out-of-order threads and processor and method for executing threads out-of-order                                                     | 712/206 |
| 169 | US<br>59077<br>02 A |   | Method and apparatus for decreasing thread switch latency in a multithread processor                                                                                                        | 718/108 |
|     | US<br>58988<br>64 A |   | Method and system for executing a context-altering instruction without performing a context-synchronization operation within high-performance processors                                    | 712/228 |
|     | US<br>58871<br>66 A |   | Method and system for constructing a program including a navigation instruction                                                                                                             | 718/102 |
| 172 | US<br>58812<br>77 A |   | Pipelined microprocessor with branch misprediction cache circuits, systems and methods                                                                                                      | 712/239 |
| 173 | US<br>58729<br>85 A |   | Switching multi-context processor and method overcoming pipeline vacancies                                                                                                                  | 710/1   |
| 174 | US<br>58257<br>70 A |   | Multiple algorithm processing on a plurality of digital signal streams via context switching                                                                                                | 370/378 |
| 175 | US<br>57421<br>80 A |   | Dynamically programmable gate array with multiple contexts                                                                                                                                  | 326/40  |

load/store unit 514. buffer 516, and a data cache 522 is shown coupled to A register file unit 518 is finally shown coupled to reorder coupled to a load/store unit 514 and to a reorder butter 516. 508, and a set 512 of reservation station/functional units is decode units is further coupled to instruction alignment unit

condition on instruction issue and a trapping to MROM condition of 32-bit flat addressing will cause a serialization data paths and instructions to be enabled. The absence of this one of the conditions to allow the collection of accelerated of this condition will be detected within processor 500 as 00000 hex and their limit address being FFFF hex. The setting 4GB of physical memory. The starting address being 0000employed where all the segment registers are mapped to all O/S and applications. Specifically, 32-bit flat addressing is well as high clock frequency execution. It also targets 32-bit the x86 to achieve both regular simple form of addressing as Processor 500 limits the addressing mechanism used in

other goals, i.e., regular instruction decoding. also interested in supporting addressing that fits into our to decode and can be decoded within a few bytes. We are memory address calculation schemes to those that are simple frequency may be accommodated is to limit the number of Another method to ensure that a relatively high clock

As a result, the x86 instruction types that are supported for

load/store operations are:

| (EAX + 8-bit displacement]  | ober.      |
|-----------------------------|------------|
| [EBP + 8-bit displacement]  | ober.      |
| [base + 8-bit displacement] | Stota      |
| [base + 8-bit displacement] | bsol       |
| (implied ESP - 8]           | <b>191</b> |
| [implied ESP + 8]           | Call       |
| [implied ESP + 4]           | dod        |
| [4 - 92H beilqmi]           | bosp       |

invoking an MROM routine to execute. address. More complicated addressing than these requires index register takes I more cycle of delay in calculating the base +%32 bit displacement takes I cycle, where using an calculating addressing within processor 500. It is noted that The block diagram of FIG. 6 shows the pipeline for

sequences of fast path instructions or extensions to fast path All other x86 instructions will be executed as micro-ROM by processor 500 as fast path instructions is provided below. An exemplary listing of the instruction sub-set supported

OVETWTITED. ring of register values that are desired to be saved instead of porate 3 operand addressing to prevent moves from occurmoves to and from memory. RISC architectures also incorare not general purpose, a large percentage of operations are ss there are so few registers in the x86 architecture and most processes during normal execution of routines. Because variables can be held during and across procedures or 32 or greater general purpose registers, and many important number of registers it provides. Most RISC processors have The standard x86 instruction set is very limited in the

register file. Modern compiler technology can make use of registers were extended instead of expanding the size of the operands, mode bits were added and the lengths of the ters. This is because when moving to 16-bit, or 32-bit registers, many side effects, and sub-registers within regisits history back to the 8080. Consequently there are tew The x86 instruction set uses a set of registers that can trace

> entry \$50 includes a call tag value of seven. Therefore, entry 448 includes a call tag value of six and respectively, cause the allocation of entries 448 and 450. the RV bit of entry 446 is set. Call instructions 411 and 413, return address D is used as the return address prediction and fetched, and the RV bit of entry 446 is clear. Therefore, of return stack storage 252 at the time Ret instruction 409 is which includes a return tag value of five. Entry 446 is the top Entry 446 is updated at the fetching of Ret instruction 409, state shown in FIG. 4C, and entry 446 has been updated. numbers 448 and 450) have been added with respect to the 416 shown in FIG. 4A). Two additional entries (reference 252 is shown after fetching Call instruction 413 (i.e. at arrow Turning next to FIG. 4D, the state of return stack storage

> 448 is used as the return address prediction for Ret instrucaddress prediction for Ret instruction 415. Therefore, entry is at the top (entry 450) has already been used as a return not at the top of return stack storage 252, but the entry which instruction 417 causes the update of entry 448. Entry 448 is instruction 415 is stored into entry 450. Similarly, Ret for Ret instruction 415 and the return tag associated with Ret shown in FIG. 4E, return address F is used as a prediction storage 252 at the time Ret instruction 415 is fetched. As instruction 415, because entry 450 is at the top of stack have been updated. Entry 450 is updated at the fetch of Ret entry 452 has been added, and entries 448, 450, and 442 FIG. 4A (i.e. at arrow 418 as shown in FIG. 4A). Yet another 252 at the end of the exemplary instruction stream shown in Turning now to FIG. 4E, the state of return stack storage

to the top of return stack storage 252. tion 421 is then decoded and causes entry 452 to be added instruction 419 (i.e. a return tag value of ten). Call instruc-442 is updated with the return tag associated with Ret address prediction. Return address B is prediction, and entry top of the stack which has not yet been used as a return entry 442 is the entry of return stack storage 252 nearest the 35 ously been used as return address predictions. Therefore, 419 is fetched, entries 450, 448, 446, and 444 have previaddress prediction is formulated. At the time Ret instruction Additionally, Ret instruction 419 is fetched and a return

respect to return stack storage 252. addresses are incorrectly predicted operate similarly with 60 are invalidated. It is noted that call instructions whose target 446, and 442 and the call tags of entries 452, 450, and 448 HG. 4E. In this example, the return tags of entries 450, 448, until return stack storage 252 achieves the state shown in instruction 407 not being determined to be mispredicted 448 are invalidated. As a second example, consider Jmp return tag of entry 446 and the call tags of entries 450 and instructions are not invalidated. As shown in FIG. 4D, the noted that call tags which are associated with retired call of four within return stack storage 252 are invalidated. It is having call and return tags subsequent to a branch tag value includes a branch tag of four, and therefore any instructions state of return stack storage 252. Jmp instruction 407 being detected at arrow 416, such that FIG. 4D shows the instruction 407 being mispredicted and the misprediction 45 from mispredicted branches. As an example, consider Jmp used to illustrate the recovery of return prediction unit 250 The exemplary code sequence shown in FIG. 4A may be

506, and to an instruction alignment unit 508. A set 510 of a prefetch/predecode unit 504, to a branch prediction unit processor 500 including an instruction cache 502 coupled to cessor are next considered. FIG. 5 is a block diagram of a aspects of another embodiment of a superscalar micropro-Turning next to FIGS. 5-66, details regarding various

|     | Docum<br>ent<br>ID  | Ū             | Title                                                                                                                                                                                                           | Current<br>OR |
|-----|---------------------|---------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------|
| 176 | US<br>56525<br>81 A |               | Distributed coding and prediction by use of contexts                                                                                                                                                            | 341/51        |
| 177 | US<br>55749<br>39 A |               | Multiprocessor coupling system with integrated compile and run time scheduling for parallelism                                                                                                                  | 712/24        |
| 178 | US<br>55509<br>93 A |               | Data processor with sets of two registers where both<br>registers receive identical information and when context<br>changes in one register the other register remains unchanged                                | 712/229       |
| 179 | US<br>55505<br>40 A |               | Distributed coding and prediction by use of contexts                                                                                                                                                            | 341/51        |
| 180 | US<br>54044<br>69 A |               | Multi-threaded microprocessor architecture utilizing static interleaving                                                                                                                                        | 712/215       |
| 181 | US<br>53613<br>37 A |               | Method and apparatus for rapidly switching processes in a<br>computer system                                                                                                                                    | 712/228       |
| 182 | US<br>53576<br>17 A |               | Method and apparatus for substantially concurrent multiple instruction thread processing by a single pipeline processor                                                                                         | 712/245       |
| 183 | US<br>53496<br>87 A |               | Speech recognition system having first and second registers enabling both to concurrently receive identical information in one context and disabling one to retain the information in a next context            | 704/231       |
| 184 | US<br>53197<br>92 A |               | Modem having first and second registers enabling both to concurrently receive identical information in one context and disabling one to retain the information in a next context                                | 712/228       |
| 185 | US<br>53197<br>89 A |               | Electromechanical apparatus having first and second registers<br>enabling both to concurrently receive identical information<br>in one context and disabling one to retain the information in<br>a next context | 712/228       |
|     | US<br>53136<br>48 A | $\overline{}$ | Signal processing apparatus having first and second registers enabling both to concurrently receive identical information in one context and disabling one to retain the information in a next context          | 712/228       |
|     | US<br>51797<br>34 A |               | Threaded interpretive data processor                                                                                                                                                                            | 712/1         |
|     | US<br>51426<br>77 A |               | Context switching devices, systems and methods                                                                                                                                                                  | 718/108       |
| 189 | US<br>48477<br>55 A |               | Parallel processing method and apparatus for increasing<br>processing throughout by parallel processing low level<br>instructions having natural concurrencies                                                  | 712/203       |

continued

dod/xuaid bictix/push məmlgər gərlgər anottanqo labigol\xnanq prefix/arithmetic operations regireg, reg/mem montgon eventarion gorlas avour kilore 16-bit operations jump unconditional

which is in the same rough range of the aligned, accelerated instructions almost always fall within 1-8 bytes in length, When executing 32-bit code under flat addressing, these

predecoded in a row may be treated as one accelerated tiple x86 instructions, for instance 2 or 3 pushes that are possible that the start/end positions predecoded reflect multions between I and 8 bytes in length. It noted that it is Accelerated instructions are defined as fast-path instrucfast path instructions.

first valid start byte within its range along with subsequent tions are dispatched such that each issue position accepts the to dispatch the instructions to four issue positions. Instrucunit uses the positions of the start bytes of the instructions start bytes within narrow ranges. The instruction alignment it moves into an instruction alignment unit which looks for When a cache line is fetched from the instruction cache, instruction that consumes 3 bytes.

the amount of time potentially required. bytes in length may be given an extra pipeline stage due to nism to scan for a constant value in an instruction over four more than seven bytes away from a start byte. The mechaassociated with each start byte, where an end byte can be no A multiplexer in each decoder looks for the end byte

45 decoder as the rest of the instruction. byte, so this field will always be located within the same relative operations is always the third byte after the start address calculation. The eight-bit displacement for stackan instruction requiring an eight-bit displacement for an 40 stant value can be delayed in the pipeline. The exception is functional unit, and therefore the determination of the conusually not needed until the instruction is issued to a have a constant as the last 1/2/4 bytes. This constant is instructions, and which are over four bytes in length, always Note that instructions included in the subset of accelerated

The assumption in the processor 500 alignment hardware The following set of instructions probably comprise 90% 55 (if included), and the fourth byte being a 16-bit data prefix. the third byte being a sib byte specifying a memory address bytes. The opcode is almost always the first two bytes, with and O/S code, the average instruction length is about three instructions are dispatched. Typically, in 32-bit application 50 allocates a second line in the buffer as the remaining reorder buffer. If this occurs, the four issue reorder buffer entry positions contained in each line of the four issue instructions to issue than can be accommodated by the four It is possible that a given cache line can have more

potentially idle. more than compensates for having some decoder positions 65 tions are still issued in parallel and at a high clock frequency instruction cache. The fact that these more compact instrucresults of instructions contained in a few lines of the lines are allocated in the four issue reorder buffer for the occurs (i.e., lots of one and two byte instructions), several 60 16-byte instruction cache lines. If very dense decoding ranges should accommodate most instructions found within dedicated issue positions and decoders assigned limited byte is that if the average instruction length is three, then four

> the real operation destinations are in memory. relegated to temporary registers for a few clock cycles while 5 compiling to the x86. The actual x86 registers are often have a much larger percentage of loads and stores when loads and stores. The effect of these same compilers is to large register sets and have a much smaller percentage of

> of variables to and from memory. registers, they tend to act as holding positions for the passing and extended with the 386. Because there are so few real instruction. The final 4 registers were added with the 8086 8, 16, or 32-bits depending on the mode of the processor or registers, EAX, EDX, ECX, and EBX, have operand sizes of 10 registers. and few are general purpose. The first four One notes from this organization that there are only 8 FIG. 7 shows a programmer's view of the x86 register file.

on the stack or in a fixed location. and all important program variables must be held in memory the register file. This is because there are too few registers instructions in parallel, it is not enough to simply multi-port 32-bit operands. If one is trying to execute multiple x86 instructions, one must be able to efficiently handle 8, 16, and The important thing to note is that when executing x86

straightforward, since they are always at fixed boundaries. a large number of instructions and their opcodes is relatively very wide issue processors. This is possible because finding and also with a small number of pipeline stages even for ustratal boundaries to achieve very high clock frequencies RISC designs employ regular instruction decoding along

length and addressing/data types of the original opcode. as well as prefix bytes and SIB bytes that can effect the processor where there are variable byte instruction formats, As stated previously, this is much more difficult in an x86  $\,^{30}$ 

tions that each instruction cache line can assume in parallel. This may be compensated for by adding many issue posiof bytes that a particular issue position can use is limited. simple instructions to fixed issue positions, where the range Processor 500 employs hardware to detect and send

tions under the conditions of 32-bit flat addressing. executed at high frequency to a sub-set of the x86 instrucissue, and limited pipeline depth by limiting the instructions be achieved. Processor 500 achieves high frequency, wide a RISC processor, allowing equivalent clock frequencies to common instructions is not significantly greater than that of position, the net amount of hardware required to decode Once the instructions are aligned to a particular issue

allow it to writeback to the data cache when the line is from. The reorder buffer then can either cancel this store or buffer, from which point it can be speculatively forwarded is held in speculative state in front of the data cache in a store corresponding entry in the reorder buffer. If a store, the store The results of executing instructions are returned to the

of the dynamically executed code for 32-bit applications:

so dumi load effective address call/return qsnd mam/gar gar/gar anotimago leaigol operations regireg regimem push arithmetic operations reg/mem reg/reg logical யാமடுப் துப்துப் சென் 8/32-bit operations

|    | Docum<br>ent<br>ID           | ם   | Title                                                                                                                                         | Current<br>OR |
|----|------------------------------|-----|-----------------------------------------------------------------------------------------------------------------------------------------------|---------------|
| 1  | US<br>20040<br>07390<br>5 A1 |     | Method and apparatus to quiesce a portion of a simultaneous multithreaded central processing unit                                             | 718/101       |
| 2  | US<br>20040<br>07377<br>8 A1 |     | Parallel processor architecture                                                                                                               | 712/220       |
| 3  | US<br>20040<br>05488<br>0 A1 |     | Microengine for parallel processor architecture                                                                                               | 712/245       |
| 4  | US<br>20040<br>03475<br>9 A1 |     | Multi-threaded pipeline with context issue rules                                                                                              | 712/1         |
| 5  | US<br>20030<br>14515<br>9 A1 |     | SRAM controller for parallel processor architecture                                                                                           | 711/104       |
| 6  | US<br>20030<br>10594<br>4 A1 |     | Method and apparatus to quiesce a portion of a simultaneous multithreaded central processing unit                                             | 712/220       |
| 7  | US<br>20030<br>09754<br>8 Al |     | Context execution in pipelined computer processor                                                                                             | 712/228       |
| 8  | US<br>20030<br>03722<br>8 A1 |     | System and method for instruction level multithreading scheduling in a embedded processor                                                     | 712/245       |
| 9  | US<br>20030<br>00526<br>2 A1 |     | Mechanism for providing high instruction fetch bandwidth in a multi-threaded processor                                                        | 712/207       |
| 10 | US<br>20020<br>12922<br>7 A1 |     | Processor having priority changing function according to threads                                                                              | 712/228       |
| 11 | US<br>20020<br>08337<br>3 A1 | 1 1 | Journaling for parallel hardware threads in multithreaded processor                                                                           | 714/38        |
| 12 | US<br>20020<br>05603<br>7 A1 |     | Method and apparatus for providing large register address space while maximizing cycletime performance for a multi-threaded register file set | 712/215       |
| 13 | US<br>20020<br>03841<br>6 A1 |     | System and method for reading and writing a thread state in a multithreaded central processing unit                                           | 712/228       |
| 14 | US<br>20020<br>02170<br>7 A1 |     | Method and apparatus for non-speculative pre-fetch operation in data packet processing                                                        | 370/412       |
| 15 | US<br>20020<br>01848<br>6 Al |     | Context selection and activation mechanism for activating one of a group of inactive contexts in a processor core for servicing interrrupts   | 370/463       |
| 16 | US<br>20020<br>00266<br>7 A1 |     | System and method for instruction level multithreading in an embedded processor using zero-time context switching                             | 712/228       |
| 17 | US<br>20010<br>05205<br>3 Al |     | Stream processing unit for a multi-streaming processor                                                                                        | 711/138       |

IRESET-Global signal used to reset ICACHE block.

to clear all pre-fetch or access in progress, and set all state Clears all state machines to Idle/Reset.

interrupt or trap is being taken including machines to Idle/Reset.

DONZELI(4:0)

the MROM.

tions from leache.

instructions will be refreshed and not accept new instruc-REFRESH2-Input from Idecode indicates current line of

current line has been dispatched to decode units. of the first (bit 0) and/or the second (bit 1) 8-byte of the

HLDISP(1:0)—Output to Idecode indicates all instructions that is being passed to the leache.

FPCTYP-Input for FIROB indicates the type of address ICAXLBEK.

tion byte for updating the branch prediction in the BRATAG(3:0)—Input from FIROB indicates the instrucmis-prediction. This signal must be gated with UPDFPC. BRATAKEA—Input from FIROB indicate the status of the

mstructions. state machine to access a new PC and clears all pending

indicates a branch mis-prediction. The Icache changes its BRAMISP-Input from the Branch execution of the FU

rəteigən flirde fedolg ədi must be compared to the array index for exact recovery of mis-predicted for updating the ICNXTBLK. This index

byte-pointer of the branch instruction which has been BPC(II:0)—Input from FIROB indicates the PC index and correction pain.

FPC(31:0)—Input from FIROB as the new PC for branch the leache to begin access the cache arrays.

has been detected. This signal accompanies the FPC for UPDFPC-Input from FIROB indicate that a new Fetch PC way associative for writing of the ICTAGV.

PFREPLCOL(2:0)—Input from CMASTER indicates the visor bit. The LV will be set in this case.

CMASTER provides the way associative and new superinstruction is in the Icache with different mapping. The

L2\_IC\_ALIAS—Input from CMASTER indicates the tion to the leache.

30 INSB(63:0)—Input from external buses for fetched instrucexternal fetched instruction is on the INSB(63:0) bus. INSTELT—Input from BIU to indicates the valid but faulted

fetched instruction is on the INSB(63:0) bus.

INSTRDY-Input from BIU to indicates the valid external clear all valid bits.

SKBINAILY—Input from SRB to invalidate the Icache by instructions must be fetched from the external memory. TRI2DIC-Input from SRB indicates that all un-cached mode or user mode of the current accessed instruction.

decoding and aligning of instructions to the decode units. A 20 SUPERV-Input from LSSEC indicates the supervisor the code segment register. If set, 32-bit, if clear, 16-bit. address size from the D bit of the segment descriptor of CS32X16—Input from LSSEC indicates operand and

an early branch can be mis-predicted at a later time. holding register. The branch mis-prediction is speculative, INVBHREG-Input from FIROB to invalidate the branch

entry point or new PC is driven.

EXCEPTION, indicates that the trap is initiated with new REOTRAP-Global input from FIROB, one cycle after all instructions in progress.

re-synchronization. Effect on Idecode and FUs is to clear EXCEPTION—Global input from FIROB indicates that an

that an interrupt or trap is being taken. Effect on leache is IDECIVALIC—Global signal from FIROB. Used to indicate

Overview of the Processor 500 Instruction Cache (Icache)

cessor 500 instruction cache has basic features including the fetching mechanism, and pre-decode information. The Pro-This section describes the instruction cache organization,

separate in a block called ICPDAT, instead of inside the 15 are aligned to 4 decode units. The pre-decode data is configured for fast scanning of instructions, and instructions ditions. Processor 500 executes the X86 instructions directly instructions align to 4 fixed length RISC-type instructions, techniques (bimodal and global) are implemented, the X86 increases to 2 targets, 2 different types of branch prediction byte of instructions are 3 bits, the branch prediction ICFPC, and ICPRED. Highlights are: the pre-decode bits per ICZIOKE, ICTAGY, ICNXTBLK, ICCNTL, ICALIGN,

The scanning for 4 instructions is done from ICPDAT before 25 and branch prediction do not resolve until the second cycle. plock and fetch the next block because the tag comparison tions. Way prediction is implemented to read the current return stack is implemented for CALL/RETURN instrucdecoding. Unconditional branches are taken during preare not modified. All branches are detected during preto write instructions directly into the array, and the prefixes ICATORE. The pre-fetch buffers are added to the ICSTORE with a few instructions requiring two Rops, the BYTEQ is and the pre-decode logic climinates many serializtion con-

are scanned to generate the controls to the multiplexers for The leache is linearly addressed. The number of pipeline The Icache size is 32K bytes with 8-way set associative. issues for the Icache and all sub-blocks. section includes signal lists, timings and implementation associative along with the data to the pre-fetch buffer. This the replacement algorithm for the Icache and sends the way branches, operand addresses, flags, displacement and immediate fields of the instruction. The CMASTER takes care of

early decoding includes decoding for unconditional

prioritizing and aligning of instructions to decode units. The

instructions must be partially decoded for the 2-rop during

not include the information for the 2-Rop instructions, the

selected by tag comparison. Since the pre-decode data does

units are shown in FIG. 8. aligning, decoding, and muxing of instructions to decode takes two clock cycles. The timing from fetching, scanning, diction includes bimodal and global branch prediction which single clock using the ICNXTBLK target. The branch pre-Icache includes a way-prediction which can be done in a 50 start decoding in the second half of the third clock. The instructions takes two clock cycles. The decode units can with the tag comparison. The seanning and alignment of MROM units. A part of the scanning logic is done in parallel aligning and sending the instructions to the decode units and 45 selected by the TAGHITs and latched. The pre-decode data read in by the end of ICLK. In the next cycle, the data are the data, tag, pre-decode, and predicting information are decoding of index address is calculated in first half of ICLK, 40 read and align the instructions to the decode units. The row stages is 9. Icache will have more than one clock cycle to

organization is included in each section. The array is orga-Throughout this documentation, a discussion of the layout

Signal list decoder is in the center of the set. nixed into many sets, and each set has its own decoder. The

SRB\_VAL-Input from SRB indicates a special register address for the array or data transferring to/from the SRB. SRBB(31:0)-I/O from SRB indicates the special register

special register instruction, for read the data is on the ICTAR\_VAL—Output to SRB indicates completion of the 65 MROMEND—Input from MENG indicates completion of instruction is on the SRBB.

|    | Docum<br>ent<br>ID   | σ | Title                                                                                                                           | Current<br>OR |
|----|----------------------|---|---------------------------------------------------------------------------------------------------------------------------------|---------------|
| 18 | US<br>66944<br>25 B1 |   | Selective flush of shared and other pipeline stages in a multithread processor                                                  | 712/216       |
| 19 | US<br>66751<br>92 B2 |   | Temporary halting of thread execution until monitoring of armed events to memory location identified in working registers       | 718/107       |
| 20 | US<br>66718<br>27 B2 |   | Journaling for parallel hardware threads in multithreaded processor                                                             | 714/38        |
| 21 | US<br>66683<br>17 B1 |   | Microengine for parallel processor architecture                                                                                 | 712/245       |
| 22 | US<br>66338<br>65 B1 |   | Multithreaded address resolution system                                                                                         | 707/3         |
| 23 | US<br>66067<br>04 B1 |   | Parallel multithreaded processor with plural microengines executing multiple threads each microengine having loadable microcode | 712/248       |
| 24 | US<br>64937<br>41 B1 |   | Method and apparatus to quiesce a portion of a simultaneous multithreaded central processing unit                               | 718/107       |
| 25 | US<br>64704<br>43 B1 |   | Pipelined multi-thread processor selecting thread instruction in inter-stage buffer based on count information                  | 712/205       |
| 26 | US<br>64703<br>76 B1 |   | Processor capable of efficiently executing many asynchronous event tasks                                                        | 718/108       |
| 27 | US<br>64271<br>96 B1 |   | SRAM controller for parallel processor architecture including address and command queue and arbiter                             | 711/158       |
| 28 | US<br>63634<br>75 B1 |   | Apparatus and method for program level parallelism in a VLIW processor                                                          | 712/206       |
| 29 | US<br>62956<br>00 B1 |   | Thread switch on blocked load or store using instruction thread field                                                           | 712/228       |
| 30 | US<br>61700<br>51 B1 |   | Apparatus and method for program level parallelism in a VLIW processor                                                          | 712/225       |
| 31 | US<br>61190<br>75 A  |   | Method for estimating statistics of properties of interactions processed by a processor pipeline                                | 702/186       |
| 32 | US<br>60947<br>15 A  |   | SIMD/MIMD processing synchronization                                                                                            | 712/20        |
| 33 | US<br>60731<br>59 A  |   | Thread properties attribute vector based thread selection in multithreading processor                                           | 718/103       |
| 34 | US<br>59665<br>28 A  |   | SIMD/MIMD array processor with vector processing                                                                                | 712/222       |
| 35 | US<br>59637<br>46 A  |   | Fully distributed processing memory element                                                                                     | 712/20        |
| 36 | US<br>59637<br>45 A  |   | APAP I/O programmable router                                                                                                    | 712/13        |
| 37 | US<br>59499<br>94 A  |   | Dedicated context-cycling computer with timed context                                                                           | 712/228       |
| 38 | US<br>58812<br>77 A  |   | Pipelined microprocessor with branch misprediction cache circuits, systems and methods                                          | 712/239       |
| 39 | US<br>58782<br>41 A  |   | Partitioning of processing elements in a SIMD/MIMD array processor                                                              | 712/203       |
| 40 | US<br>58706<br>19 A  |   | Array processor with asynchronous availability of a next SIMD instruction                                                       | 712/20        |

xxI CF-carry flag,

AIX OF-overflow flag, Lax SF-sign, ZF-zero, PF-parity, and AF-auxiliary carry 5

uses/writes for this instruction of decode unit 1. DIWRFI (4:0) Output to FIROB indicates the type of flag DIOSELF (4:0)

D2WRFL(4:0)—Output to FIROB indicates the type of flag 10 D2USEFL(4:0)

D3USEFL(4:0) uses/writes for this instruction of decode unit 2.

uses/writes for this instruction of decode unit 3. D3WRFI(4:0)—Output to FIROB indicates the type of flag

I of decode unit 0. The MROM is responsible to send bit RD0PIR1(5:0)—Indicates the register address for operand 15

I of decode unit I. The MROM is responsible to send bit RDIPTRI(5:0)—Indicates the register address for operand 5:3 for the MROM register.

RD2PTR1(5:0)—Indicates the register address for operand 5:3 for the MROM register.

KD3FIRI(5:0)—Indicates the register address for operand 5:3 for the MROM register. I of decode unit 2. The MROM is responsible to send bit

RD0PTR2(5:0)—Indicates register address for operand 2 of 5:3 for the MROM register. I of decode unit 3. The MROM is responsible to send bit 25

for the MROM register. decode unit 0. The MROM is responsible to send bit 5:3

RD2PTR2(5:0)—Indicates register address for operand 2 of for the MROM register. decode unit 1. The MROM is responsible to send bit 5:3RDIFTR2(5:0)—Indicates register address for operand 2 of 30

RD3FTR2(5:0)—Indicates register address for operand 2 of for the MROM register. decode unit 2. The MROM is responsible to send bit 5:3

IDXDAT(1:0)—Output to indicates the data size informafor the MROM register. decode unit 3. The MROM is responsible to send bit 5:3

shift register in case of branch mis-prediction. The branch the first target branch instruction with respect to the global ICBTAG1(3:0)—Output to Idecode indicates the position of tion. 01-byte, 10-half word, 11-word, 00-not used.

global shift register in case of branch mis-prediction. The the second target branch instruction with respect to the ICBTAG2(3:0)—Output to Idecode indicates the position of all branch instruction. can be taken or non-taken, branch tag must be sent with

with all branch instruction. branch can be taken or non-taken, branch tag must be sent

UNIMP(3:0)—Output indicates the unconditional branch

predicted taken branch. The operand steering uses this BRATKN(3:0)—Output indicates which decode unit has a instruction needs to calculate target address.

signal to latch and send BTADDR(31:0) to the functional 55

BRAINST(3:0)—Output indicates which decode unit has a

signal to latch and send ICBTAG1(3:0) and ICBTAG2 global branch prediction. The operand steering uses this

(3:0) to the functional units.

instruction is detected, the return stack should be updated CALLDEC(3:0)—Output to FIROB indicates the CALL

with the PC address of instruction after CALL. The

RETDEC(3:0)—Output to FIROB indicates a RETURN 65 information is latched for mis-predicted CALL branch.

mis-predicted RETURM branch. instruction is detected. The information is latched for

for dispatching to decode units.

cates the extended opcode field.

cates the opcode byte.

cates the opcode byte.

cates the opcode byte.

cates the opcode byte.

to-decode 3:

to decode 2.

to decode 1.

to decode 0.

M3USEFL(4:0)

MZUSEFL(4:0)

MINSELI(4:0)

MOUSEFL(4:0)

x1x OF-overflow flag,

xxl CF-carry flag,

tions from leache.

SIB-byte instruction.

SIB-byte instruction.

٠z

MR3EOP(2:0)—Input from MENG to decode unit 3 indi-

MRZEOP(2:0)—Input from MENG to decode unit 2 indi-

MRIEOP(2:0)—Input from MENG to decode unit I indi-

MR3OPC(7:0)-Input from MENG to decode unit 3 indi-

MR2OPC(7:0)—Input from MENG to decode unit 2 indi-

MRIOPC(7:0)—Input from MENG to decode unit I indi-

MROOPC(7:0)—Input from MENG to decode unit 0 indi-

MINS3(63:0)—Input from MENG indicates the displace-

MINS2(63:0)—Input from MENG indicates the displace-

MINSI(63:0)—Input from MENG indicates the displace-

40 MINSO(63:0) Input from MENG indicates the displace-

ment and immediate field of micro-instruction being sent

flag used/written for this micro-instruction of decode unit

flag used/written for this micro-instruction of decode unit

M2WRFL(4:0)—Input from MENG indicates the type of

flag used/written for this micro-instruction of decode unit

Lxx SF-sign, ZF-zero, PF-parity, and AF-auxiliary carry

flag used/written for this micro-instruction of decode unit

MOWRFL(4:0)—Input from MENG indicates the type of

instructions is sent to Idecode instead of the Icache.

MROMEN-Input from MENG indicates the micro-

IB1(191:0)—Output indicates the combined instruction line

instructions will be refreshed and not accept new instruc-

dency tag on the first instruction; the second rop of the

REFRESH2—Input from Idecode indicates current line of

DEPTAG(3:1)—Output to FIROB indicates forced depen-

NODEST(3:0)—Output to FIROB indicates no destination

for the first rop of the SIB-byte instruction.

MIWRFI (4:0)—Input from MENG indicates the type of

M3WRFL(4:0)—Input from MENG indicates the type of

60 MR0EOP(2:0)—Input from MENG to decode unit 0 indi-

I indicates the MUL instruction, and bit 0 indicates the

the 2-rop instruction, bit 2 indicates POP instruction, bit instruction. Bit 3 indicates the first rop or second rop of ICZROPC(3:0)—Output to decode unit 0 indicates 2-rop repeat prefixes for MROM.

encoded prefix byte. The two most significant bits are ICPREF(9:0)—Output to Idecode and MROM indicates the

|    | Docum<br>ent<br>ID  | ט | Title                                                                                                                                                  | Current<br>OR |
|----|---------------------|---|--------------------------------------------------------------------------------------------------------------------------------------------------------|---------------|
| 41 | US<br>58483<br>73 A |   | Computer aided map location system                                                                                                                     | 701/200       |
| 42 | US<br>58420<br>31 A |   | Advanced parallel array processor (APAP)                                                                                                               | 712/23        |
| 43 | US<br>58094<br>50 A |   | Method for estimating statistics of properties of instructions processed by a processor pipeline                                                       | 702/186       |
| 44 | US<br>57940<br>59 A |   | N-dimensional modified hypercube                                                                                                                       | 712/10        |
| 45 | US<br>57650<br>11 A |   | Parallel processing system having a synchronous SIMD processing with processing elements emulating SIMD operation using individual instruction streams | 712/20        |
| 46 | US<br>57615<br>23 A |   | Parallel processing system having asynchronous SIMD processing and data parallel coding                                                                | 712/20        |
| 47 | US<br>57548<br>71 A |   | Parallel processing system having asynchronous SIMD processing                                                                                         | 712/20        |
| 48 | US<br>57520<br>67 A |   | Fully scalable parallel processing system having asynchronous<br>SIMD processing                                                                       | 712/16        |
| 49 | US<br>57349<br>21 A |   | Advanced parallel array processor computer package                                                                                                     | 712/10        |
| 50 | US<br>57179<br>44 A |   | Autonomous SIMD/MIMD processor memory elements                                                                                                         | 712/20        |
| 51 | US<br>57179<br>43 A |   | Advanced parallel array processor (APAP)                                                                                                               | 712/20        |
| 52 | US<br>57130<br>37 A |   | Slide bus communication functions for SIMD/MIMD array processor                                                                                        | 702/33        |
| 53 | US<br>57109<br>35 A |   | Advanced parallel array processor (APAP)                                                                                                               | 712/20        |
| 54 | US<br>57088<br>36 A |   | SIMD/MIMD inter-processor communication                                                                                                                | 712/20        |
| 55 | US<br>56258<br>36 A |   | SIMD/MIMD processing memory element (PME)                                                                                                              | 709/214       |
| 56 | US<br>55903<br>45 A |   | Advanced parallel array processor(APAP)                                                                                                                | 712/11        |
| 57 | US<br>55881<br>52 A |   | Advanced parallel processor including advanced support hardware                                                                                        | 712/16        |
| 58 | US<br>43251<br>20 A |   | Data processing system                                                                                                                                 | 711/202       |

Decoding of all opcodes is needed to detect immediate 8-byte block, and bit 4:3 indicates which 8-byte block. displacement pointer and size. Bits 2:0 is the pointer to the IMMPTR2(4:0)—Output to decode unit 2 indicates the

Decoding of all opcodes is needed to detect immediate 8-byte block, and bit 4:3 indicates which 8-byte block. displacement pointer and size. Bits 2:0 is the pointer to the IMMPTR3(4:0)—Output to decode unit 3 indicates the

stant for add/substract to ESP of the two-dispatch position CONSTn(2:0)—Output to decode unit n indicates the confield.

instructions is sent to Idecode instead of the Icache. MROMEM—Input from MEMG indicates the microinstruction.

line of instructions. IB2(191:0)—Output to decode units indicates the current

cycles to read the IB, ICEND, and ICFUNC. tion is MROM. The MROM instruction may take two ICMROM—Output to MENG indicates the current instruc-

branch instruction. target of a previous instruction which is a predicted taken ICPCITAR—Output to Idecode indicates is ICPCI a branch

branch instruction. target of a previous instruction which is a predicted taken ICPC2TAR—Output to Idecode indicates is ICPC2 a branch

pass along with the instruction to FIROB. PC of the first instruction in the 4 issued instructions to ICPCI(31:0)—Output to Idecode indicates the current line

boundary or branch target in the 4 issued instructions to PC of a second instruction which cross the 16-byte

byte position of the next instruction. Bit 4 indicates the ICPOSO(4:0)—Output to decode unit 0 indicates the PC's pass along with the instruction to FIROB.

byte position of the next instruction. Bit 4 indicates the ICPOSI(4:0)—Output to decode unit I indicates the PC's next instruction is on the next line.

byte position of the next instruction. Bit 4 indicates the next instruction is on the next line.

byte position of the next instruction. Bit 4 indicates the ICPOS3(4:0)—Output to decode unit 3 indicates the PC's next instruction is on the next line.

target branch instruction for a new line with respect to the BTAGIN(3:0)—Output indicates the position of the first next instruction is on the next line.

target branch instruction for a new line with respect to the BTAG2N(3:0)—Output indicates the position of the second global shift register in case of branch mis-prediction.

BVALL. Bit 0 is the last line and bit I is new line. cates a predicted taken branch instruction from PIAKEN, BTAKEN1(1:0)—Output to decode units and ICFPC indiglobal shift register in case of branch mis-prediction.

an instruction pre-fetched, the type of exception (TLB-ICERROR—Output, indicates an exception has occurred on BVAL2. Bit 0 is the last line and bit I is new line. cates a predicted taken branch instruction from PTAKEN,

pre-fetch buffer in the leache has space for a new line tion fetching from the previous incremented address, the INSPFET—Output to BIU and CMASTER requests instrucalso be asserted. mize, page-fault, illegal opcode, external bus error) will

ICAD(31:0)—Output to MMU indicates a new fetch PC from external memory.

request to external memory.

the scale factor of the SIB byte. MR0SS(1:0)—Input from MENG to decode unit 0 indicates

the scale factor of the SIB byte. MRISS(1:0)—Input from MENG to decode unit 1 indicates

MRZSS(I:0)—Input from MENG to decode unit 2 indicates 5

MR3SS(1:0)—Input from MENG to decode unit 3 indicates the scale factor of the SIB byte.

tion is MROM. The MROM instruction may take two ICMROM—Output to MENG indicates the current instructhe scale factor of the SIB byte.

decoding is completed for the current instruction. The ENDINST—Input from ICPRED indicates that precycles to read the IB, ICEND, and ICFUNC.

sent to decode unit 0. STARTFIR. The selected instruction from IB should be byte position of the branch instruction is from

tions. NOOP is generated for invalid instruction. ICVALI(3:0)—Output to Idecode indicates valid instruc-

ICOOPC(7:0)—Output to decode unit 0 indicates the opcode

ICIOPC(7:0)—Output to decode unit 1 indicates the opcode 20

IC2OPC(7:0)—Output to decode unit 2 indicates the opcode

IC3OPC(7:0)—Output to decode unit 3 indicates the opcode byte.

ICOEOP(2:0)—Output to decode unit 0 indicates the

ICIEOP(2:0)—Output to decode unit 1 indicates the extended opcode field.

extended opcode field. ICZEOP(2:0)—Output to decode unit 2 indicates the 30 ICPC2(31:0)—Output to Idecode indicates the current line extended opcode field.

extended opcode field. IC3EOP(2:0)—Output to decode unit 3 indicates the

factor of the SIB byte. ICOSS(1:0)—Output to decode unit 0 indicates the scale

factor of the SIB byte. ICISS(1:0)—Output to decode unit 1 indicates the scale

factor of the SIB byte. ICZSS(1:0)—Output to decode unit 2 indicates the scale

factor of the SIB byte. IC3SS(1:0)—Output to decode unit 3 indicates the scale 40 ICPOSZ(4:0)—Output to decode unit 2 indicates the PC's

DISPTR1(6:0)—Output to decode unit I indicates the diswhich 8-byte block. Bit 6:5=00 indicates no displacement. 45 8-byte block, bit 6:5 is the size, and bit 4:3 indicates placement pointer and size. Bits 2:0 is the pointer to the DISPTR@(6:0)—Output to decode unit 0 indicates the dis-

DISPTR2(6:0)—Output to decode unit 2 indicates the dis- 50 which 8-byte block. Bit 6:5=00 indicates no displacement. 8-byte block, bit 6:5 is the size, and bit 4:3 indicates placement pointer and size. Bits 2:0 is the pointer to the

placement pointer and size. Bits 2:0 is the pointer to the 55 BTAKEN2(1:0)—Output to decode units and ICFPC indi-DISPIR3(6:0)—Output to decode unit 3 indicates the diswhich 8-byte block. Bit 6:5=00 indicates no displacement. 8-byte block, bit 6:5 is the size, and bit 4:3 indicates placement pointer and size. Bits 2:0 is the pointer to the

displacement pointer and size. Bits 2:0 is the pointer to the IMMPTR@(4:0)—Output to decode unit 0 indicates the which 8-byte block. Bit 6:5=00 indicates no displacement. 8-byte block, bit 6:5 is the size, and bit 4:3 indicates

Decoding of all opcodes is needed to detect immediate 8-byte block, and bit 4:3 indicates which 8-byte block. 60

Decoding of all opcodes is needed to detect immediate 8-byte block, and bit 4:3 indicates which 8-byte block. 65 displacement pointer and size. Bits 2:0 is the pointer to the IMMPTR1(4:0)—Output to decode unit I indicates the